# Focused Learning: LLM-based Noise Classification in Code Reviews

## Learning Objectives

This notebook provides an in-depth exploration of using Large Language Models (LLMs) to classify code review comments as "valid" or "noisy". This is a fundamental concept from the paper "Too Noisy To Learn" (Section IV) that enables semantic data cleaning.

**What you'll learn:**
1. How to design effective prompts for classification tasks
2. The impact of different prompt strategies on classification performance
3. Why simpler prompts often outperform complex ones
4. How to evaluate classification performance with appropriate metrics

**Paper Reference**: Section IV - Semantic Data Cleaning via LLMs (RQ1)

## 1. Theoretical Foundation

### 1.1 The Classification Problem

The paper identifies a critical challenge in automated code review: training datasets contain significant noise (32-36% noisy comments). This noise includes:

- **Vague questions**: "Why this change?"
- **Non-actionable feedback**: "I don't understand this"
- **Clarification requests**: "What does this do?"

### 1.2 Valid vs Noisy Comments

**Valid Comments** (Definition from Section III):
- Provide clear suggestions for improvement
- Explicitly express issues
- Outline necessary actions
- Have clear action types (refactoring, testing, bug fixes)

**Noisy Comments**:
- Don't request direct actions
- Unclear or difficult to understand
- Merely justify code changes
- Vague or ambiguous

### 1.3 Why LLMs for Classification?

Traditional approaches using heuristics (keyword matching, sentence length) fail because:
1. They lack semantic understanding
2. Cannot handle nuanced language
3. Miss context-dependent meanings

LLMs offer:
- Deep semantic understanding
- Context awareness
- Cross-language generalizability

## 2. Environment Setup

In [None]:
import json
import numpy as np
import pandas as pd
from typing import Dict, List, Tuple
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, HTML

# LangChain imports
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate
from langchain_core.output_parsers import JsonOutputParser
from langchain_core.pydantic_v1 import BaseModel, Field

# Metrics
from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
np.random.seed(42)

## 3. Understanding Classification Output Schema

The paper uses structured output from LLMs to ensure consistent classification results.

In [None]:
class CommentClassification(BaseModel):
    """Schema for LLM classification output
    
    This ensures the LLM returns structured data that we can reliably parse.
    """
    label: str = Field(
        description="Classification result: must be exactly 'valid' or 'noisy'"
    )
    explanation: str = Field(
        description="Detailed reasoning for the classification decision"
    )
    confidence: float = Field(
        description="Confidence score between 0.0 and 1.0",
        ge=0.0,
        le=1.0
    )

# Example of expected output
example_output = CommentClassification(
    label="valid",
    explanation="The comment provides a specific suggestion to rename a variable for better readability",
    confidence=0.85
)

print("Example Classification Output:")
print(json.dumps(example_output.dict(), indent=2))

## 4. Prompt Engineering Deep Dive

The paper tests 4 prompt configurations:
1. P_DEFINITION with RNL (review comment only)
2. P_DEFINITION with RNL+CDIFF (comment + code diff)
3. P_AUXILIARY with RNL
4. P_AUXILIARY with RNL+CDIFF

In [None]:
class PromptDesigner:
    """Implements different prompt strategies from the paper"""
    
    @staticmethod
    def get_definitions() -> Tuple[str, str]:
        """Get valid and noisy comment definitions from Section III"""
        
        valid_def = """Valid comments are review comments that provide clear suggestions aimed at improving 
the source code. Given the submitted code change (code diff), the valid comment should:
- Explicitly express the issues
- Clearly outline necessary actions to improve the code
- Have clear type of requested actions such as:
  * Refactoring for code quality
  * Writing tests
  * Aligning with design principles
  * Fixing bugs
  * Enhancing logging
  * Addressing specific needs"""
        
        noisy_def = """Noisy comments are review comments that do not request direct and applicable actions 
to refine the code, or the message expressed is unclear and difficult to understand.
This includes:
- Comments that do not explicitly ask for specific changes
- Comments merely justifying the submitted code change
- Comments of low quality due to vagueness
- Comments with ambiguity that hinders understanding
- Questions without actionable suggestions"""
        
        return valid_def, noisy_def
    
    @staticmethod
    def get_auxiliary_rules() -> str:
        """Get auxiliary rules developed during annotation (Section IV-A)"""
        
        return """Auxiliary Rules (based on annotation guidelines):
1. Comments asking "why" without suggesting alternatives are typically NOISY
2. Comments that only express confusion without specific feedback are NOISY
3. Comments providing specific code suggestions or improvements are VALID
4. Comments identifying bugs or potential issues with solutions are VALID
5. Comments requesting documentation or test additions are VALID
6. One-word or very short comments without context are typically NOISY
7. Comments that merely acknowledge changes without feedback are NOISY

Examples:
- "Why do we have this flag?" → NOISY (asks why without suggestion)
- "Consider using a constant for this magic number" → VALID (specific suggestion)
- "What is the purpose of this line?" → NOISY (seeks clarification only)
- "Add null check before accessing user.getName()" → VALID (specific action)"""
    
    def create_p_definition_prompt(self, with_context: bool = False) -> ChatPromptTemplate:
        """Create P_DEFINITION prompt (simpler version)"""
        
        valid_def, noisy_def = self.get_definitions()
        
        system_message = f"""You are an experienced code reviewer evaluating review comments.
Your task is to classify each comment as either 'valid' or 'noisy'.

DEFINITIONS:

{valid_def}

{noisy_def}

Evaluation Criteria:
1. Relevance: Does the comment address the code change?
2. Clarity: Is the comment clear with actionable feedback?
3. Improvement Focus: Does it aim to improve code quality?

Return your analysis in JSON format."""
        
        if with_context:
            human_message = """Analyze this code review:

CODE DIFF:
{code_diff}

REVIEW COMMENT:
{comment}

Classification:"""
        else:
            human_message = """Analyze this review comment:

COMMENT:
{comment}

Classification:"""
        
        return ChatPromptTemplate.from_messages([
            SystemMessagePromptTemplate.from_template(system_message),
            HumanMessagePromptTemplate.from_template(human_message)
        ])
    
    def create_p_auxiliary_prompt(self, with_context: bool = False) -> ChatPromptTemplate:
        """Create P_AUXILIARY prompt (with additional rules)"""
        
        valid_def, noisy_def = self.get_definitions()
        auxiliary = self.get_auxiliary_rules()
        
        system_message = f"""You are an experienced code reviewer evaluating review comments.
Your task is to classify each comment as either 'valid' or 'noisy'.

DEFINITIONS:

{valid_def}

{noisy_def}

{auxiliary}

Return your analysis in JSON format."""
        
        if with_context:
            human_message = """Analyze this code review:

CODE DIFF:
{code_diff}

REVIEW COMMENT:
{comment}

Classification:"""
        else:
            human_message = """Analyze this review comment:

COMMENT:
{comment}

Classification:"""
        
        return ChatPromptTemplate.from_messages([
            SystemMessagePromptTemplate.from_template(system_message),
            HumanMessagePromptTemplate.from_template(human_message)
        ])

# Initialize prompt designer
prompt_designer = PromptDesigner()

# Display the different prompts
print("=== P_DEFINITION Prompt (Simpler) ===")
print(prompt_designer.create_p_definition_prompt().messages[0].prompt.template[:500] + "...")
print("\n=== P_AUXILIARY Prompt (With Rules) ===")
print(prompt_designer.create_p_auxiliary_prompt().messages[0].prompt.template[:500] + "...")

## 5. Building the Classification Pipeline

Now let's implement the complete classification system with different LLMs.

In [None]:
class NoiseClassifier:
    """Complete implementation of the LLM-based noise classifier"""
    
    def __init__(self, model_name: str = "gpt-3.5-turbo", temperature: float = 0.1):
        """
        Initialize classifier with specified model.
        
        Args:
            model_name: LLM to use (gpt-3.5-turbo, gpt-4, etc.)
            temperature: Set to 0.1 for consistency (as in paper)
        """
        self.model_name = model_name
        self.llm = ChatOpenAI(model=model_name, temperature=temperature)
        self.parser = JsonOutputParser(pydantic_object=CommentClassification)
        self.prompt_designer = PromptDesigner()
        
    def classify(self, 
                comment: str, 
                code_diff: str = None,
                prompt_type: str = "definition") -> Dict:
        """
        Classify a single comment.
        
        Args:
            comment: The review comment text
            code_diff: Optional code diff context
            prompt_type: "definition" or "auxiliary"
            
        Returns:
            Classification result with label, explanation, and confidence
        """
        # Select prompt
        with_context = code_diff is not None
        
        if prompt_type == "definition":
            prompt = self.prompt_designer.create_p_definition_prompt(with_context)
        else:
            prompt = self.prompt_designer.create_p_auxiliary_prompt(with_context)
        
        # Add format instructions
        format_instructions = self.parser.get_format_instructions()
        
        # Build chain
        chain = prompt | self.llm | self.parser
        
        # Prepare inputs
        inputs = {"comment": comment}
        if with_context:
            inputs["code_diff"] = code_diff
        
        try:
            result = chain.invoke(inputs)
            return result
        except Exception as e:
            print(f"Classification error: {e}")
            return {
                "label": "noisy",
                "explanation": "Error in processing",
                "confidence": 0.0
            }
    
    def batch_classify(self, 
                      comments: List[Dict],
                      prompt_type: str = "definition") -> List[Dict]:
        """Classify multiple comments"""
        results = []
        
        for i, item in enumerate(comments):
            comment = item['comment']
            code_diff = item.get('code_diff', None)
            true_label = item.get('label', None)
            
            # Classify
            classification = self.classify(comment, code_diff, prompt_type)
            
            # Store result
            results.append({
                'comment': comment,
                'true_label': true_label,
                'predicted_label': classification['label'],
                'confidence': classification['confidence'],
                'explanation': classification['explanation']
            })
            
            print(f"[{i+1}/{len(comments)}] Classified: {classification['label']} "
                  f"(confidence: {classification['confidence']:.2f})")
        
        return results

## 6. Creating Test Dataset

Let's create a test dataset based on examples from the paper.

In [None]:
# Test dataset based on paper examples
test_comments = [
    # Noisy examples (from paper)
    {
        "comment": "Why do we have this flag?",
        "code_diff": "+   this.useWriterSchema = true;",
        "label": "noisy",
        "source": "Figure 1 - Top"
    },
    {
        "comment": "What is the purpose of this line?",
        "code_diff": "+ data = normalize_values(data)",
        "label": "noisy",
        "source": "Section VI-D"
    },
    {
        "comment": "Why do we need to change this?",
        "code_diff": "- DEFAULT_TIMEOUT = 30\n+ DEFAULT_TIMEOUT = 60",
        "label": "noisy",
        "source": "Inferred from patterns"
    },
    {
        "comment": "I don't understand this change",
        "code_diff": "- debug: false,\n+ debug: true,",
        "label": "noisy",
        "source": "Inferred from patterns"
    },
    
    # Valid examples (from paper)
    {
        "comment": "This can be simplified as new ArrayList<>(Arrays.asList(new ProtocolConfig(protocol)))",
        "code_diff": "- this.protocols = Arrays.asList(new ProtocolConfig[]{new ProtocolConfig(protocol)});\n" +
                     "+ this.protocols = new ArrayList<>(Arrays.asList(new ProtocolConfig[]{new ProtocolConfig(protocol)}));",
        "label": "valid",
        "source": "Figure 1 - Bottom"
    },
    {
        "comment": "Please rename $prev_ref to $previousRef",
        "code_diff": "+ $prev_ref = $product->getRef();",
        "label": "valid",
        "source": "Figure 5"
    },
    {
        "comment": "Shouldn't this be an assert instead of a throw?",
        "code_diff": "if (value < 0) {\n-   throw std::invalid_argument(\"Value must be positive\");\n+   assert(value >= 0);",
        "label": "valid",
        "source": "Section VI-A"
    },
    {
        "comment": "Add error handling for null values",
        "code_diff": "+ if (user == null) {\n+     throw new IllegalArgumentException(\"User cannot be null\");\n+ }\n  return user.getName().toUpperCase();",
        "label": "valid",
        "source": "Inferred from patterns"
    }
]

# Display dataset statistics
df_test = pd.DataFrame(test_comments)
print("Test Dataset Statistics:")
print(f"Total samples: {len(df_test)}")
print(f"Valid comments: {len(df_test[df_test['label'] == 'valid'])}")
print(f"Noisy comments: {len(df_test[df_test['label'] == 'noisy'])}")
print("\nSample distribution:")
print(df_test['label'].value_counts())

## 7. Running Classification Experiments

Let's reproduce the experiments from Table I in the paper.

In [None]:
def run_experiment(classifier: NoiseClassifier, 
                  test_data: List[Dict],
                  experiment_name: str,
                  prompt_type: str,
                  use_context: bool) -> Dict:
    """Run a single classification experiment"""
    
    print(f"\n=== Running Experiment: {experiment_name} ===")
    print(f"Prompt type: {prompt_type}, Use context: {use_context}")
    
    # Prepare data
    if not use_context:
        # Remove code_diff for RNL-only experiments
        test_data_processed = [
            {k: v for k, v in item.items() if k != 'code_diff'}
            for item in test_data
        ]
    else:
        test_data_processed = test_data
    
    # Run classification
    results = classifier.batch_classify(test_data_processed, prompt_type)
    
    # Calculate metrics
    true_labels = [r['true_label'] for r in results]
    pred_labels = [r['predicted_label'] for r in results]
    
    # Binary encoding for metrics
    true_binary = [1 if l == 'valid' else 0 for l in true_labels]
    pred_binary = [1 if l == 'valid' else 0 for l in pred_labels]
    
    # Calculate metrics
    metrics = {
        'experiment': experiment_name,
        'overall': {
            'precision': precision_score(true_binary, pred_binary, average='weighted'),
            'recall': recall_score(true_binary, pred_binary, average='weighted'),
            'f1': f1_score(true_binary, pred_binary, average='weighted')
        },
        'valid': {
            'precision': precision_score(true_binary, pred_binary, pos_label=1),
            'recall': recall_score(true_binary, pred_binary, pos_label=1),
            'f1': f1_score(true_binary, pred_binary, pos_label=1),
            'count': sum(pred_binary)
        },
        'noisy': {
            'precision': precision_score(true_binary, pred_binary, pos_label=0),
            'recall': recall_score(true_binary, pred_binary, pos_label=0),
            'f1': f1_score(true_binary, pred_binary, pos_label=0),
            'count': len(pred_binary) - sum(pred_binary)
        },
        'results': results
    }
    
    return metrics

# Initialize classifier
classifier = NoiseClassifier(model_name="gpt-3.5-turbo")

# Run experiments (reproducing Table I configurations)
experiments = [
    ("P_DEFINITION with RNL", "definition", False),
    ("P_DEFINITION with RNL+CDIFF", "definition", True),
    ("P_AUXILIARY with RNL", "auxiliary", False),
    ("P_AUXILIARY with RNL+CDIFF", "auxiliary", True)
]

# Run one experiment for demonstration
exp_name, prompt_type, use_context = experiments[0]
results = run_experiment(classifier, test_comments, exp_name, prompt_type, use_context)

# Display results
print(f"\n=== Results for {exp_name} ===")
print(f"Overall - Precision: {results['overall']['precision']:.3f}, "
      f"Recall: {results['overall']['recall']:.3f}, "
      f"F1: {results['overall']['f1']:.3f}")
print(f"Valid - Precision: {results['valid']['precision']:.3f}, "
      f"Recall: {results['valid']['recall']:.3f}, "
      f"Count: {results['valid']['count']}")
print(f"Noisy - Precision: {results['noisy']['precision']:.3f}, "
      f"Recall: {results['noisy']['recall']:.3f}, "
      f"Count: {results['noisy']['count']}")

## 8. Analyzing Classification Results

Let's visualize and analyze the classification results in detail.

In [None]:
def visualize_results(results: Dict):
    """Create comprehensive visualizations of classification results"""
    
    # Extract data
    true_labels = [r['true_label'] for r in results['results']]
    pred_labels = [r['predicted_label'] for r in results['results']]
    confidences = [r['confidence'] for r in results['results']]
    
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
    
    # 1. Confusion Matrix
    cm = confusion_matrix(true_labels, pred_labels, labels=['valid', 'noisy'])
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
                xticklabels=['valid', 'noisy'],
                yticklabels=['valid', 'noisy'],
                ax=axes[0, 0])
    axes[0, 0].set_title('Confusion Matrix')
    axes[0, 0].set_xlabel('Predicted Label')
    axes[0, 0].set_ylabel('True Label')
    
    # 2. Confidence Distribution
    axes[0, 1].hist(confidences, bins=10, edgecolor='black', alpha=0.7)
    axes[0, 1].set_xlabel('Confidence Score')
    axes[0, 1].set_ylabel('Count')
    axes[0, 1].set_title('Confidence Score Distribution')
    axes[0, 1].axvline(np.mean(confidences), color='red', linestyle='--', 
                      label=f'Mean: {np.mean(confidences):.2f}')
    axes[0, 1].legend()
    
    # 3. Performance Metrics
    metrics_data = {
        'Valid': [results['valid']['precision'], results['valid']['recall'], results['valid']['f1']],
        'Noisy': [results['noisy']['precision'], results['noisy']['recall'], results['noisy']['f1']],
        'Overall': [results['overall']['precision'], results['overall']['recall'], results['overall']['f1']]
    }
    
    x = np.arange(3)
    width = 0.25
    
    for i, (label, values) in enumerate(metrics_data.items()):
        axes[1, 0].bar(x + i*width, values, width, label=label)
    
    axes[1, 0].set_xlabel('Metrics')
    axes[1, 0].set_ylabel('Score')
    axes[1, 0].set_title('Classification Performance by Class')
    axes[1, 0].set_xticks(x + width)
    axes[1, 0].set_xticklabels(['Precision', 'Recall', 'F1'])
    axes[1, 0].legend()
    axes[1, 0].set_ylim(0, 1.1)
    
    # 4. Confidence by Correctness
    correct_conf = [c for i, c in enumerate(confidences) if true_labels[i] == pred_labels[i]]
    incorrect_conf = [c for i, c in enumerate(confidences) if true_labels[i] != pred_labels[i]]
    
    axes[1, 1].boxplot([correct_conf, incorrect_conf], labels=['Correct', 'Incorrect'])
    axes[1, 1].set_ylabel('Confidence Score')
    axes[1, 1].set_title('Confidence by Prediction Correctness')
    
    plt.tight_layout()
    plt.show()

# Visualize the results
visualize_results(results)

## 9. Key Insights and Findings

Let's analyze why certain prompts perform better, reproducing insights from Section IV-D.

In [None]:
def analyze_prompt_effectiveness():
    """Analyze why simpler prompts outperform complex ones"""
    
    insights = {
        "Finding 1: Simpler is Better": {
            "observation": "P_DEFINITION with RNL only achieves highest precision (85.1% for GPT-3.5)",
            "explanation": "Simpler prompts reduce cognitive load on the model and avoid distraction",
            "evidence": "Table I shows consistent pattern across all models"
        },
        
        "Finding 2: Context Can Hurt": {
            "observation": "Adding code diff (CDIFF) reduces performance by 9.9-24%",
            "explanation": "Additional context can distract from the core classification task",
            "evidence": "CodeLlama precision drops from 66.0% to 63.8% with context"
        },
        
        "Finding 3: Auxiliary Rules Have Mixed Effects": {
            "observation": "P_AUXILIARY improves some models but hurts others",
            "explanation": "Model-specific: helps CodeLlama but reduces GPT-3.5 performance",
            "evidence": "CodeLlama F1 improves from 58.0% to 70.1% with auxiliary rules"
        },
        
        "Finding 4: Model Temperature Matters": {
            "observation": "Low temperature (0.1) ensures consistency",
            "explanation": "Classification tasks benefit from deterministic behavior",
            "evidence": "Paper uses 0.1 temperature for all experiments"
        }
    }
    
    # Display insights
    for title, details in insights.items():
        print(f"\n{title}")
        print("=" * len(title))
        for key, value in details.items():
            print(f"{key.capitalize()}: {value}")

analyze_prompt_effectiveness()

## 10. Error Analysis

Let's examine common classification errors to understand model limitations.

In [None]:
def perform_error_analysis(results: Dict):
    """Analyze misclassified comments to understand patterns"""
    
    errors = []
    for r in results['results']:
        if r['true_label'] != r['predicted_label']:
            errors.append(r)
    
    print(f"Total errors: {len(errors)} out of {len(results['results'])} "
          f"({len(errors)/len(results['results'])*100:.1f}%)")
    
    if errors:
        print("\n=== Error Analysis ===")
        for i, error in enumerate(errors, 1):
            print(f"\nError {i}:")
            print(f"Comment: '{error['comment']}'")
            print(f"True: {error['true_label']}, Predicted: {error['predicted_label']}")
            print(f"Confidence: {error['confidence']:.2f}")
            print(f"Explanation: {error['explanation'][:100]}...")
    
    # Common error patterns from the paper (Section VII)
    print("\n=== Common Error Patterns (from paper) ===")
    patterns = [
        "1. Domain-specific terms: LLMs may incorrectly classify comments with technical terms",
        "2. Implicit suggestions: Comments that imply actions without explicit statements",
        "3. Context-dependent: Some comments need project context to classify correctly",
        "4. Edge cases: Very short comments or questions can be ambiguous"
    ]
    
    for pattern in patterns:
        print(pattern)

# Perform error analysis
perform_error_analysis(results)

## 11. Practical Implementation Guide

Based on the paper's findings, here's a practical guide for implementing noise classification.

In [None]:
class ProductionClassifier:
    """Production-ready implementation based on paper's best practices"""
    
    def __init__(self):
        # Use best performing configuration from Table I
        self.model = "gpt-3.5-turbo"  # 85.1% precision on valid
        self.temperature = 0.1  # Low temperature for consistency
        self.prompt_type = "definition"  # Simpler prompt performs better
        self.use_context = False  # RNL only (no code diff)
        
        self.classifier = NoiseClassifier(
            model_name=self.model,
            temperature=self.temperature
        )
    
    def classify_for_cleaning(self, comment: str) -> Tuple[str, float]:
        """
        Classify comment for dataset cleaning.
        
        Returns:
            (label, confidence) tuple
        """
        result = self.classifier.classify(
            comment=comment,
            code_diff=None,  # Best performance without context
            prompt_type=self.prompt_type
        )
        
        return result['label'], result['confidence']
    
    def should_keep_comment(self, comment: str, threshold: float = 0.7) -> bool:
        """
        Determine if comment should be kept in cleaned dataset.
        
        Args:
            comment: Review comment text
            threshold: Confidence threshold for keeping comments
            
        Returns:
            True if comment is valid with sufficient confidence
        """
        label, confidence = self.classify_for_cleaning(comment)
        
        # Keep only valid comments with high confidence
        return label == "valid" and confidence >= threshold
    
    def clean_dataset(self, comments: List[str], 
                     threshold: float = 0.7) -> Dict:
        """
        Clean a dataset of comments.
        
        Returns:
            Dictionary with cleaned comments and statistics
        """
        kept = []
        removed = []
        
        for comment in comments:
            if self.should_keep_comment(comment, threshold):
                kept.append(comment)
            else:
                removed.append(comment)
        
        return {
            'kept': kept,
            'removed': removed,
            'original_size': len(comments),
            'cleaned_size': len(kept),
            'reduction_rate': len(removed) / len(comments),
            'improvement': 'Expected 64% → 85% valid ratio'
        }

# Demonstration
prod_classifier = ProductionClassifier()

# Test on sample comments
test_samples = [
    "Why do we have this flag?",  # Noisy
    "Consider using a constant for this magic number",  # Valid
    "What does this do?",  # Noisy
    "Add null check before accessing the property"  # Valid
]

print("=== Production Classifier Test ===")
for comment in test_samples:
    label, conf = prod_classifier.classify_for_cleaning(comment)
    keep = prod_classifier.should_keep_comment(comment)
    print(f"\nComment: '{comment}'")
    print(f"Classification: {label} (confidence: {conf:.2f})")
    print(f"Keep in dataset: {keep}")

## 12. Cost-Benefit Analysis

Let's calculate the cost-benefit of LLM-based cleaning (from Section VII).

In [None]:
def calculate_cost_benefit(dataset_size: int = 128058):  # Paper's dataset size
    """Calculate cost-benefit analysis for LLM vs manual cleaning"""
    
    # Costs from paper
    llm_cost = 50  # USD for GPT-3.5
    llm_time = 39  # hours
    
    # Manual annotation costs
    manual_rate = 8  # USD per hour
    time_per_comment = 1/60  # 1 minute per comment in hours
    manual_time = dataset_size * time_per_comment
    manual_cost = manual_time * manual_rate
    
    # Benefits
    bleu_improvement = 0.13  # 13% improvement
    quality_improvement = 0.24  # 24% information score improvement
    training_reduction = 0.66  # 66% less data needed
    
    print("=== Cost-Benefit Analysis ===")
    print(f"\nDataset size: {dataset_size:,} comments")
    
    print("\n--- Costs ---")
    print(f"LLM Approach:")
    print(f"  Cost: ${llm_cost}")
    print(f"  Time: {llm_time} hours")
    
    print(f"\nManual Approach:")
    print(f"  Cost: ${manual_cost:,.0f}")
    print(f"  Time: {manual_time:,.0f} hours")
    
    print(f"\nCost Reduction: {(1 - llm_cost/manual_cost)*100:.1f}%")
    print(f"Time Reduction: {(1 - llm_time/manual_time)*100:.1f}%")
    
    print("\n--- Benefits ---")
    print(f"BLEU-4 Score Improvement: +{bleu_improvement*100:.0f}%")
    print(f"Information Quality: +{quality_improvement*100:.0f}%")
    print(f"Training Data Reduction: {training_reduction*100:.0f}% (with better results!)")
    
    print("\n--- ROI ---")
    roi = (manual_cost - llm_cost) / llm_cost
    print(f"Return on Investment: {roi:.0f}x")
    print(f"Break-even dataset size: {int(50 / (manual_cost/dataset_size)):,} comments")

calculate_cost_benefit()

## 13. Summary and Key Takeaways

### Main Findings:

1. **Simpler Prompts Win**: P_DEFINITION with only comment text achieves best results
2. **Context Can Hurt**: Adding code diff reduces performance by up to 24%
3. **High Precision Achievable**: 85% precision in identifying valid comments
4. **Cost-Effective**: 512x cheaper than manual annotation

### Best Practices:

1. Use simple, focused prompts
2. Keep temperature low (0.1) for consistency
3. Focus on comment text only (skip code context)
4. Use confidence thresholds for production

### Future Directions:

1. Ensemble methods combining multiple LLMs
2. Fine-tuning smaller models on classified data
3. Active learning for edge cases
4. Domain-specific adaptations

This focused learning notebook has demonstrated how LLMs can effectively classify code review comments, enabling semantic data cleaning that significantly improves model performance while reducing costs.