# Focused Learning: Semantic Data Cleaning Pipeline

## Learning Objectives

This notebook provides a deep dive into the semantic data cleaning pipeline that transforms noisy code review datasets into high-quality training data. We'll explore how removing noisy comments paradoxically improves model performance despite reducing dataset size.

**What you'll learn:**
1. How to build an end-to-end data cleaning pipeline using LLMs
2. The impact of data quality vs. data quantity on model performance
3. How to create controlled experiments for fair comparison
4. Practical strategies for dataset curation at scale

**Paper Reference**: Section V - Impact on Comment Generation Accuracy (RQ2)

## 1. Theoretical Foundation

### 1.1 The Data Quality Paradox

The paper reveals a counterintuitive finding: **smaller, cleaner datasets outperform larger, noisy ones**.

Key statistics from the paper:
- **Original dataset**: 117,739 training samples (64% valid)
- **Cleaned with GPT-3.5**: 39,625 samples (85% valid) - 66% reduction
- **Cleaned with Llama3**: 87,872 samples (75% valid) - 25% reduction
- **Result**: 7.5-13% improvement in BLEU-4 scores

### 1.2 Why Quality Matters More Than Quantity

1. **Noise propagation**: Models learn and amplify patterns from noisy data
2. **Signal clarity**: Clean data provides clearer learning signals
3. **Efficiency**: Less data means faster training with better results

### 1.3 The Cleaning Process

The semantic cleaning pipeline:
1. **Classification**: Use LLMs to classify each comment as valid/noisy
2. **Filtering**: Retain only comments predicted as valid
3. **Validation**: Compare against controlled datasets
4. **Evaluation**: Measure impact on downstream tasks

## 2. Environment Setup

In [None]:
import numpy as np
import pandas as pd
from typing import List, Dict, Tuple, Optional
from dataclasses import dataclass, field
import random
import json
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm

# LangChain for LLM integration
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import JsonOutputParser
from langchain_core.pydantic_v1 import BaseModel, Field

# For parallel processing
from concurrent.futures import ThreadPoolExecutor, as_completed
import asyncio

# Set style and seed
plt.style.use('seaborn-v0_8-whitegrid')
np.random.seed(42)
random.seed(42)

print("Environment setup complete!")

## 3. Data Structures for Cleaning Pipeline

In [None]:
@dataclass
class ReviewComment:
    """Enhanced data structure for code review comments"""
    id: str
    comment_text: str
    code_diff: str
    project: str
    timestamp: datetime
    
    # Original labels (if available)
    original_label: Optional[str] = None
    
    # Cleaning metadata
    predicted_label: Optional[str] = None
    confidence: Optional[float] = None
    cleaning_model: Optional[str] = None
    cleaning_timestamp: Optional[datetime] = None
    explanation: Optional[str] = None

@dataclass
class CleaningStats:
    """Statistics for the cleaning process"""
    original_size: int
    cleaned_size: int
    removed_size: int
    
    original_valid_ratio: float
    cleaned_valid_ratio: float
    
    reduction_percentage: float
    improvement_percentage: float
    
    processing_time: float
    cost_estimate: float
    
    # Distribution of confidence scores
    confidence_distribution: Dict[str, List[float]] = field(default_factory=dict)

@dataclass 
class DatasetPartition:
    """Represents a dataset partition (train/val/test)"""
    name: str
    comments: List[ReviewComment]
    stats: Optional[CleaningStats] = None

print("Data structures defined!")

## 4. Creating Mock CodeReviewer Dataset

Based on the statistics from Table II in the paper.

In [None]:
def generate_mock_dataset(size: int = 1000, valid_ratio: float = 0.64) -> List[ReviewComment]:
    """Generate mock dataset matching paper's statistics"""
    
    # Valid comment templates
    valid_templates = [
        "Consider extracting this logic into a separate method for better reusability",
        "This variable name should follow camelCase convention: {}",
        "Add error handling for the case when {} is null",
        "This method is too long. Consider breaking it down into smaller functions",
        "Use a constant instead of this magic number {}",
        "Add unit tests to cover this edge case",
        "This could be simplified using a ternary operator",
        "Consider using dependency injection here for better testability",
        "Add documentation explaining the purpose of this algorithm",
        "This recursive approach might cause stack overflow for large inputs"
    ]
    
    # Noisy comment templates  
    noisy_templates = [
        "Why this change?",
        "What does this do?",
        "I don't understand this",
        "Is this necessary?",
        "???",
        "Not sure about this",
        "hmm",
        "This looks weird",
        "Why not use the old approach?",
        "What's the purpose of this line?"
    ]
    
    # Code diff templates
    code_diffs = [
        "+ this.config = loadConfig();",
        "- return data.process()\n+ return data.validate().process()",
        "+ if (user == null) throw new Error('User required');",
        "- for (int i = 0; i < 100; i++)\n+ for (int i = 0; i < MAX_ITERATIONS; i++)",
        "+ logger.debug('Processing item: ' + item.id);"
    ]
    
    # Projects
    projects = ["apache/commons", "spring/framework", "tensorflow/models", 
               "facebook/react", "microsoft/vscode"]
    
    dataset = []
    n_valid = int(size * valid_ratio)
    
    # Generate valid comments
    for i in range(n_valid):
        template = random.choice(valid_templates)
        comment_text = template.format(
            random.choice(["userId", "configValue", "MAX_SIZE", "responseData"])
        ) if "{}" in template else template
        
        dataset.append(ReviewComment(
            id=f"comment_{i}",
            comment_text=comment_text,
            code_diff=random.choice(code_diffs),
            project=random.choice(projects),
            timestamp=datetime.now(),
            original_label="valid"
        ))
    
    # Generate noisy comments
    for i in range(n_valid, size):
        dataset.append(ReviewComment(
            id=f"comment_{i}",
            comment_text=random.choice(noisy_templates),
            code_diff=random.choice(code_diffs),
            project=random.choice(projects),
            timestamp=datetime.now(),
            original_label="noisy"
        ))
    
    # Shuffle dataset
    random.shuffle(dataset)
    return dataset

# Generate datasets matching paper sizes (scaled down for demo)
train_data = generate_mock_dataset(1000, 0.64)  # Scaled from 117,739
val_data = generate_mock_dataset(100, 0.64)     # Scaled from 10,319
test_data = generate_mock_dataset(100, 0.64)    # Scaled from 10,169

print(f"Generated mock datasets:")
print(f"  Training: {len(train_data)} samples")
print(f"  Validation: {len(val_data)} samples")
print(f"  Test: {len(test_data)} samples")
print(f"\nValid ratio: {sum(1 for c in train_data if c.original_label == 'valid') / len(train_data):.1%}")

## 5. Implementing the Semantic Data Cleaner

This is the core component that implements the cleaning approach from Section V-A.

In [None]:
class SemanticDataCleaner:
    """Production-grade semantic data cleaner using LLMs"""
    
    def __init__(self, model_name: str = "gpt-3.5-turbo", batch_size: int = 10):
        self.model_name = model_name
        self.batch_size = batch_size
        self.llm = ChatOpenAI(model=model_name, temperature=0.1)
        
        # Best performing prompt from RQ1
        self.prompt = self._create_classification_prompt()
        self.parser = JsonOutputParser(pydantic_object=self._get_output_schema())
        
    def _get_output_schema(self):
        class Classification(BaseModel):
            label: str = Field(description="'valid' or 'noisy'")
            confidence: float = Field(description="0.0 to 1.0")
            explanation: str = Field(description="Brief reasoning")
        return Classification
    
    def _create_classification_prompt(self) -> ChatPromptTemplate:
        """Create the best performing prompt (P_DEFINITION with RNL)"""
        
        template = """You are an experienced code reviewer. Classify this review comment as 'valid' or 'noisy'.

Valid comments: Provide clear suggestions for code improvement with specific actions.
Noisy comments: Vague, unclear, or don't request specific changes.

Comment: {comment}

Return JSON with label, confidence, and brief explanation."""
        
        return ChatPromptTemplate.from_template(template)
    
    def classify_comment(self, comment: ReviewComment) -> Dict:
        """Classify a single comment"""
        chain = self.prompt | self.llm | self.parser
        
        try:
            result = chain.invoke({"comment": comment.comment_text})
            return result
        except Exception as e:
            # Default to noisy on error
            return {
                "label": "noisy",
                "confidence": 0.0,
                "explanation": f"Classification error: {str(e)}"
            }
    
    def clean_dataset(self, 
                     dataset: List[ReviewComment],
                     progress_callback=None) -> Tuple[DatasetPartition, DatasetPartition, CleaningStats]:
        """
        Clean a dataset by removing noisy comments.
        
        Returns:
            - Cleaned dataset partition
            - Removed dataset partition  
            - Cleaning statistics
        """
        start_time = datetime.now()
        
        cleaned_comments = []
        removed_comments = []
        confidence_scores = {"valid": [], "noisy": []}
        
        # Process in batches for efficiency
        total_batches = len(dataset) // self.batch_size + (1 if len(dataset) % self.batch_size else 0)
        
        for batch_idx in tqdm(range(total_batches), desc="Cleaning batches"):
            batch_start = batch_idx * self.batch_size
            batch_end = min((batch_idx + 1) * self.batch_size, len(dataset))
            batch = dataset[batch_start:batch_end]
            
            # Process batch
            for comment in batch:
                classification = self.classify_comment(comment)
                
                # Update comment metadata
                comment.predicted_label = classification["label"]
                comment.confidence = classification["confidence"]
                comment.explanation = classification["explanation"]
                comment.cleaning_model = self.model_name
                comment.cleaning_timestamp = datetime.now()
                
                # Sort into cleaned or removed
                if classification["label"] == "valid":
                    cleaned_comments.append(comment)
                    confidence_scores["valid"].append(classification["confidence"])
                else:
                    removed_comments.append(comment)
                    confidence_scores["noisy"].append(classification["confidence"])
            
            if progress_callback:
                progress_callback(batch_idx + 1, total_batches)
        
        # Calculate statistics
        processing_time = (datetime.now() - start_time).total_seconds()
        
        # Calculate valid ratios
        original_valid = sum(1 for c in dataset if c.original_label == "valid") / len(dataset)
        cleaned_valid = sum(1 for c in cleaned_comments if c.original_label == "valid") / len(cleaned_comments) if cleaned_comments else 0
        
        stats = CleaningStats(
            original_size=len(dataset),
            cleaned_size=len(cleaned_comments),
            removed_size=len(removed_comments),
            original_valid_ratio=original_valid,
            cleaned_valid_ratio=cleaned_valid,
            reduction_percentage=(len(removed_comments) / len(dataset)) * 100,
            improvement_percentage=((cleaned_valid - original_valid) / original_valid) * 100 if original_valid > 0 else 0,
            processing_time=processing_time,
            cost_estimate=self._estimate_cost(len(dataset)),
            confidence_distribution=confidence_scores
        )
        
        # Create partitions
        cleaned_partition = DatasetPartition(
            name=f"cleaned_{self.model_name}",
            comments=cleaned_comments,
            stats=stats
        )
        
        removed_partition = DatasetPartition(
            name=f"removed_{self.model_name}",
            comments=removed_comments
        )
        
        return cleaned_partition, removed_partition, stats
    
    def _estimate_cost(self, n_comments: int) -> float:
        """Estimate cleaning cost based on paper's figures"""
        # Paper: $50 for 128,058 comments with GPT-3.5
        cost_per_comment = 50 / 128058
        return n_comments * cost_per_comment

# Initialize cleaner
cleaner = SemanticDataCleaner(model_name="gpt-3.5-turbo")
print("Semantic data cleaner initialized!")

## 6. Running the Cleaning Process

Let's clean our mock dataset and analyze the results.

In [None]:
# Clean the training dataset
print("Cleaning training dataset...")
cleaned_train, removed_train, train_stats = cleaner.clean_dataset(train_data[:100])  # Use subset for demo

print("\n=== Cleaning Statistics ===")
print(f"Original size: {train_stats.original_size}")
print(f"Cleaned size: {train_stats.cleaned_size} ({train_stats.cleaned_size/train_stats.original_size:.1%})")
print(f"Removed: {train_stats.removed_size} ({train_stats.reduction_percentage:.1f}%)")
print(f"\nValid ratio:")
print(f"  Original: {train_stats.original_valid_ratio:.1%}")
print(f"  Cleaned: {train_stats.cleaned_valid_ratio:.1%}")
print(f"  Improvement: {train_stats.improvement_percentage:+.1f}%")
print(f"\nProcessing time: {train_stats.processing_time:.1f}s")
print(f"Estimated cost: ${train_stats.cost_estimate:.2f}")

## 7. Visualizing Cleaning Results

Let's create comprehensive visualizations to understand the cleaning impact.

In [None]:
def visualize_cleaning_results(stats: CleaningStats, cleaned: DatasetPartition, removed: DatasetPartition):
    """Create visualizations for cleaning results"""
    
    fig, axes = plt.subplots(2, 3, figsize=(15, 10))
    
    # 1. Dataset size comparison
    sizes = [stats.original_size, stats.cleaned_size, stats.removed_size]
    labels = ['Original', 'Cleaned', 'Removed']
    colors = ['#3498db', '#2ecc71', '#e74c3c']
    
    axes[0, 0].bar(labels, sizes, color=colors)
    axes[0, 0].set_ylabel('Number of Comments')
    axes[0, 0].set_title('Dataset Size Comparison')
    
    # Add percentage labels
    for i, (label, size) in enumerate(zip(labels[1:], sizes[1:])):
        pct = size / stats.original_size * 100
        axes[0, 0].text(i+1, size + stats.original_size*0.01, f'{pct:.1f}%', ha='center')
    
    # 2. Valid ratio improvement
    ratios = [stats.original_valid_ratio, stats.cleaned_valid_ratio]
    labels = ['Original', 'Cleaned']
    
    bars = axes[0, 1].bar(labels, ratios, color=['#95a5a6', '#27ae60'])
    axes[0, 1].set_ylabel('Valid Comment Ratio')
    axes[0, 1].set_title('Data Quality Improvement')
    axes[0, 1].set_ylim(0, 1)
    
    # Add value labels
    for bar, ratio in zip(bars, ratios):
        axes[0, 1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01, 
                       f'{ratio:.1%}', ha='center')
    
    # 3. Confidence distribution
    if stats.confidence_distribution['valid']:
        axes[0, 2].hist(stats.confidence_distribution['valid'], bins=20, alpha=0.7, 
                       label='Valid', color='#2ecc71')
    if stats.confidence_distribution['noisy']:
        axes[0, 2].hist(stats.confidence_distribution['noisy'], bins=20, alpha=0.7,
                       label='Noisy', color='#e74c3c')
    axes[0, 2].set_xlabel('Confidence Score')
    axes[0, 2].set_ylabel('Count')
    axes[0, 2].set_title('Classification Confidence Distribution')
    axes[0, 2].legend()
    
    # 4. Classification accuracy (if original labels available)
    tp = sum(1 for c in cleaned.comments if c.original_label == 'valid')
    fp = sum(1 for c in cleaned.comments if c.original_label == 'noisy')
    tn = sum(1 for c in removed.comments if c.original_label == 'noisy')
    fn = sum(1 for c in removed.comments if c.original_label == 'valid')
    
    cm = [[tp, fn], [fp, tn]]
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
                xticklabels=['Pred Valid', 'Pred Noisy'],
                yticklabels=['True Valid', 'True Noisy'],
                ax=axes[1, 0])
    axes[1, 0].set_title('Classification Performance')
    
    # 5. Comment length distribution
    cleaned_lengths = [len(c.comment_text.split()) for c in cleaned.comments]
    removed_lengths = [len(c.comment_text.split()) for c in removed.comments]
    
    axes[1, 1].boxplot([cleaned_lengths, removed_lengths], 
                      labels=['Cleaned', 'Removed'])
    axes[1, 1].set_ylabel('Comment Length (words)')
    axes[1, 1].set_title('Comment Length Distribution')
    
    # 6. Cost-benefit analysis
    paper_stats = {
        'GPT-3.5': {'reduction': 66, 'improvement': 13.0, 'cost': 50},
        'Llama3': {'reduction': 25, 'improvement': 7.5, 'cost': 0},
        'Manual': {'reduction': 0, 'improvement': 0, 'cost': 25600}
    }
    
    methods = list(paper_stats.keys())
    costs = [paper_stats[m]['cost'] for m in methods]
    improvements = [paper_stats[m]['improvement'] for m in methods]
    
    ax2 = axes[1, 2].twinx()
    
    bars1 = axes[1, 2].bar(np.arange(len(methods)) - 0.2, costs, 0.4, 
                          label='Cost ($)', color='#e74c3c')
    bars2 = ax2.bar(np.arange(len(methods)) + 0.2, improvements, 0.4,
                    label='BLEU Improvement (%)', color='#2ecc71')
    
    axes[1, 2].set_xlabel('Method')
    axes[1, 2].set_ylabel('Cost ($)', color='#e74c3c')
    ax2.set_ylabel('BLEU Improvement (%)', color='#2ecc71')
    axes[1, 2].set_xticks(np.arange(len(methods)))
    axes[1, 2].set_xticklabels(methods)
    axes[1, 2].set_title('Cost vs. Performance (from paper)')
    axes[1, 2].set_yscale('log')  # Log scale for cost due to large difference
    
    plt.tight_layout()
    plt.show()

# Visualize results
visualize_cleaning_results(train_stats, cleaned_train, removed_train)

## 8. Creating Controlled Datasets

To ensure fair comparison, we need controlled datasets with the same size as cleaned ones.

In [None]:
class ControlledDatasetCreator:
    """Create controlled datasets for fair comparison"""
    
    @staticmethod
    def create_controlled_dataset(original: List[ReviewComment], 
                                target_size: int,
                                strategy: str = "random") -> DatasetPartition:
        """
        Create controlled dataset by sampling from original.
        
        Strategies:
        - random: Random sampling
        - stratified: Maintain original valid/noisy ratio
        - recent: Select most recent comments
        """
        
        if strategy == "random":
            sampled = random.sample(original, min(target_size, len(original)))
            
        elif strategy == "stratified":
            # Maintain original ratio
            valid_comments = [c for c in original if c.original_label == "valid"]
            noisy_comments = [c for c in original if c.original_label == "noisy"]
            
            valid_ratio = len(valid_comments) / len(original)
            n_valid = int(target_size * valid_ratio)
            n_noisy = target_size - n_valid
            
            sampled_valid = random.sample(valid_comments, min(n_valid, len(valid_comments)))
            sampled_noisy = random.sample(noisy_comments, min(n_noisy, len(noisy_comments)))
            
            sampled = sampled_valid + sampled_noisy
            random.shuffle(sampled)
            
        elif strategy == "recent":
            # Sort by timestamp and take most recent
            sorted_comments = sorted(original, key=lambda c: c.timestamp, reverse=True)
            sampled = sorted_comments[:target_size]
            
        else:
            raise ValueError(f"Unknown strategy: {strategy}")
        
        # Calculate stats
        valid_ratio = sum(1 for c in sampled if c.original_label == "valid") / len(sampled)
        
        return DatasetPartition(
            name=f"controlled_{strategy}",
            comments=sampled,
            stats=CleaningStats(
                original_size=len(original),
                cleaned_size=len(sampled),
                removed_size=len(original) - len(sampled),
                original_valid_ratio=sum(1 for c in original if c.original_label == "valid") / len(original),
                cleaned_valid_ratio=valid_ratio,
                reduction_percentage=(1 - len(sampled)/len(original)) * 100,
                improvement_percentage=0,  # No improvement expected
                processing_time=0,
                cost_estimate=0,
                confidence_distribution={}
            )
        )

# Create controlled datasets
controller = ControlledDatasetCreator()

# Match the size of cleaned dataset
controlled_random = controller.create_controlled_dataset(
    train_data, 
    train_stats.cleaned_size,
    strategy="random"
)

controlled_stratified = controller.create_controlled_dataset(
    train_data,
    train_stats.cleaned_size, 
    strategy="stratified"
)

print("=== Controlled Dataset Statistics ===")
print(f"\nRandom sampling:")
print(f"  Size: {len(controlled_random.comments)}")
print(f"  Valid ratio: {controlled_random.stats.cleaned_valid_ratio:.1%}")

print(f"\nStratified sampling:")
print(f"  Size: {len(controlled_stratified.comments)}")
print(f"  Valid ratio: {controlled_stratified.stats.cleaned_valid_ratio:.1%}")

## 9. Simulating Model Training Impact

Let's simulate how different datasets affect model training (Section V-B).

In [None]:
class ModelTrainingSimulator:
    """Simulate the impact of data cleaning on model training"""
    
    def __init__(self, model_type: str = "CodeReviewer"):
        self.model_type = model_type
        self.training_history = []
        
    def simulate_training(self, dataset: DatasetPartition, epochs: int = 10) -> Dict:
        """
        Simulate training on a dataset.
        
        In real implementation, this would fine-tune actual models.
        Here we simulate based on paper's findings.
        """
        print(f"\nSimulating {self.model_type} training on {dataset.name}...")
        print(f"Dataset size: {len(dataset.comments)}")
        print(f"Valid ratio: {dataset.stats.cleaned_valid_ratio:.1%}")
        
        # Simulate training metrics based on data quality
        base_bleu = 5.73  # Original CodeReviewer BLEU-4 from paper
        
        # Quality affects final performance
        quality_factor = dataset.stats.cleaned_valid_ratio
        
        # Size affects convergence speed
        size_factor = len(dataset.comments) / 1000  # Normalized
        
        history = {
            'epochs': [],
            'loss': [],
            'bleu': [],
            'val_bleu': []
        }
        
        for epoch in range(epochs):
            # Simulate loss decay
            loss = 2.0 * np.exp(-epoch * quality_factor * 0.3) + 0.2
            
            # Simulate BLEU improvement
            if 'cleaned' in dataset.name:
                # Cleaned data improves faster and reaches higher BLEU
                improvement = 0.13 * (1 - np.exp(-epoch * 0.5))  # Up to 13% improvement
            elif 'controlled' in dataset.name:
                # Controlled shows minimal improvement
                improvement = 0.02 * (1 - np.exp(-epoch * 0.3))
            else:
                # Original dataset baseline
                improvement = 0
            
            bleu = base_bleu * (1 + improvement)
            val_bleu = bleu * 0.95  # Validation slightly lower
            
            history['epochs'].append(epoch + 1)
            history['loss'].append(loss)
            history['bleu'].append(bleu) 
            history['val_bleu'].append(val_bleu)
            
            if epoch % 3 == 0:
                print(f"  Epoch {epoch+1}: Loss={loss:.3f}, BLEU={bleu:.2f}")
        
        # Final results
        final_results = {
            'dataset': dataset.name,
            'final_bleu': history['bleu'][-1],
            'improvement': (history['bleu'][-1] - base_bleu) / base_bleu * 100,
            'training_time': len(dataset.comments) * 0.01,  # Simulated time
            'history': history
        }
        
        self.training_history.append(final_results)
        return final_results
    
    def compare_results(self) -> pd.DataFrame:
        """Compare training results across datasets"""
        
        comparison = []
        for result in self.training_history:
            comparison.append({
                'Dataset': result['dataset'],
                'Final BLEU-4': f"{result['final_bleu']:.2f}",
                'Improvement': f"{result['improvement']:+.1f}%",
                'Training Time': f"{result['training_time']:.1f}h"
            })
        
        return pd.DataFrame(comparison)

# Simulate training on different datasets
trainer = ModelTrainingSimulator("CodeReviewer")

# Train on original (baseline)
original_partition = DatasetPartition(
    name="original",
    comments=train_data[:100],
    stats=CleaningStats(
        original_size=100,
        cleaned_size=100,
        removed_size=0,
        original_valid_ratio=0.64,
        cleaned_valid_ratio=0.64,
        reduction_percentage=0,
        improvement_percentage=0,
        processing_time=0,
        cost_estimate=0
    )
)

results_original = trainer.simulate_training(original_partition)
results_cleaned = trainer.simulate_training(cleaned_train)
results_controlled = trainer.simulate_training(controlled_random)

# Display comparison
print("\n=== Training Results Comparison ===")
comparison_df = trainer.compare_results()
print(comparison_df.to_string(index=False))

## 10. Visualizing Training Impact

Let's visualize how data cleaning affects training dynamics.

In [None]:
def visualize_training_comparison(training_history: List[Dict]):
    """Visualize training curves for different datasets"""
    
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))
    
    colors = {'original': '#3498db', 'cleaned_gpt-3.5-turbo': '#2ecc71', 
              'controlled_random': '#e67e22'}
    
    # 1. Loss curves
    for result in training_history:
        color = colors.get(result['dataset'], '#95a5a6')
        axes[0].plot(result['history']['epochs'], 
                    result['history']['loss'],
                    label=result['dataset'].replace('_', ' ').title(),
                    color=color,
                    linewidth=2)
    
    axes[0].set_xlabel('Epoch')
    axes[0].set_ylabel('Loss')
    axes[0].set_title('Training Loss Curves')
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)
    
    # 2. BLEU improvement
    for result in training_history:
        color = colors.get(result['dataset'], '#95a5a6')
        axes[1].plot(result['history']['epochs'],
                    result['history']['bleu'],
                    label=result['dataset'].replace('_', ' ').title(),
                    color=color,
                    linewidth=2)
    
    axes[1].set_xlabel('Epoch')
    axes[1].set_ylabel('BLEU-4 Score')
    axes[1].set_title('BLEU-4 Score Evolution')
    axes[1].legend()
    axes[1].grid(True, alpha=0.3)
    
    # Add paper's reported improvements as reference lines
    axes[1].axhline(y=5.73, color='gray', linestyle='--', alpha=0.5, label='Original baseline')
    axes[1].axhline(y=6.47, color='green', linestyle='--', alpha=0.5, label='Paper: Cleaned (+13%)')
    
    # 3. Final performance comparison
    datasets = [r['dataset'].replace('_', '\n') for r in training_history]
    final_bleus = [r['final_bleu'] for r in training_history]
    improvements = [r['improvement'] for r in training_history]
    
    bars = axes[2].bar(range(len(datasets)), final_bleus, 
                      color=[colors.get(r['dataset'], '#95a5a6') for r in training_history])
    
    # Add improvement percentages on bars
    for i, (bar, imp) in enumerate(zip(bars, improvements)):
        height = bar.get_height()
        axes[2].text(bar.get_x() + bar.get_width()/2, height + 0.05,
                    f'{imp:+.1f}%', ha='center', fontsize=10)
    
    axes[2].set_xticks(range(len(datasets)))
    axes[2].set_xticklabels(datasets)
    axes[2].set_ylabel('Final BLEU-4 Score')
    axes[2].set_title('Final Performance Comparison')
    axes[2].set_ylim(5, 7)
    
    plt.tight_layout()
    plt.show()

# Visualize training comparison
visualize_training_comparison(trainer.training_history)

## 11. Analyzing Different Cleaning Models

The paper tests two LLMs for cleaning: GPT-3.5 and Llama3. Let's compare their characteristics.

In [None]:
def compare_cleaning_models():
    """Compare different LLMs for data cleaning (Table II from paper)"""
    
    # Statistics from the paper
    cleaning_models = {
        'GPT-3.5': {
            'training_cleaned': 39625,
            'training_original': 117739,
            'reduction': 66.3,
            'precision_valid': 85.1,
            'recall_noisy': 88.8,
            'bleu_improvement': 13.0,
            'cost': 50,
            'time_hours': 39
        },
        'Llama3': {
            'training_cleaned': 87872,
            'training_original': 117739,
            'reduction': 25.4,
            'precision_valid': 75.3,
            'recall_noisy': 51.0,
            'bleu_improvement': 7.5,
            'cost': 0,  # Open source
            'time_hours': 15
        }
    }
    
    # Create comparison visualization
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
    
    models = list(cleaning_models.keys())
    
    # 1. Dataset reduction comparison
    reductions = [cleaning_models[m]['reduction'] for m in models]
    kept = [100 - r for r in reductions]
    
    x = np.arange(len(models))
    width = 0.35
    
    axes[0, 0].bar(x - width/2, kept, width, label='Kept', color='#2ecc71')
    axes[0, 0].bar(x + width/2, reductions, width, label='Removed', color='#e74c3c')
    axes[0, 0].set_ylabel('Percentage (%)')
    axes[0, 0].set_title('Dataset Size After Cleaning')
    axes[0, 0].set_xticks(x)
    axes[0, 0].set_xticklabels(models)
    axes[0, 0].legend()
    
    # 2. Classification performance
    metrics = ['Precision (Valid)', 'Recall (Noisy)']
    gpt_scores = [cleaning_models['GPT-3.5']['precision_valid'], 
                  cleaning_models['GPT-3.5']['recall_noisy']]
    llama_scores = [cleaning_models['Llama3']['precision_valid'],
                   cleaning_models['Llama3']['recall_noisy']]
    
    x = np.arange(len(metrics))
    axes[0, 1].bar(x - width/2, gpt_scores, width, label='GPT-3.5', color='#3498db')
    axes[0, 1].bar(x + width/2, llama_scores, width, label='Llama3', color='#9b59b6')
    axes[0, 1].set_ylabel('Score (%)')
    axes[0, 1].set_title('Classification Performance')
    axes[0, 1].set_xticks(x)
    axes[0, 1].set_xticklabels(metrics)
    axes[0, 1].legend()
    axes[0, 1].set_ylim(0, 100)
    
    # 3. BLEU improvement vs dataset size
    sizes = [cleaning_models[m]['training_cleaned'] for m in models]
    improvements = [cleaning_models[m]['bleu_improvement'] for m in models]
    
    axes[1, 0].scatter(sizes, improvements, s=200, alpha=0.7)
    for i, model in enumerate(models):
        axes[1, 0].annotate(model, (sizes[i], improvements[i]), 
                           xytext=(5, 5), textcoords='offset points')
    
    axes[1, 0].set_xlabel('Cleaned Dataset Size')
    axes[1, 0].set_ylabel('BLEU-4 Improvement (%)')
    axes[1, 0].set_title('Performance vs Dataset Size Trade-off')
    axes[1, 0].grid(True, alpha=0.3)
    
    # 4. Cost-efficiency analysis
    costs = [cleaning_models[m]['cost'] for m in models]
    times = [cleaning_models[m]['time_hours'] for m in models]
    
    # Normalize by improvement for efficiency metric
    cost_per_improvement = [c/i if c > 0 else 0 for c, i in zip(costs, improvements)]
    time_per_improvement = [t/i for t, i in zip(times, improvements)]
    
    x = np.arange(len(models))
    axes[1, 1].bar(x - width/2, cost_per_improvement, width, 
                  label='$/% improvement', color='#e74c3c')
    axes[1, 1].bar(x + width/2, time_per_improvement, width,
                  label='Hours/% improvement', color='#f39c12')
    axes[1, 1].set_ylabel('Efficiency Metric')
    axes[1, 1].set_title('Cleaning Efficiency Comparison')
    axes[1, 1].set_xticks(x)
    axes[1, 1].set_xticklabels(models)
    axes[1, 1].legend()
    
    plt.tight_layout()
    plt.show()
    
    # Summary table
    print("\n=== Cleaning Model Comparison Summary ===")
    comparison_data = []
    for model, stats in cleaning_models.items():
        comparison_data.append({
            'Model': model,
            'Dataset Reduction': f"{stats['reduction']:.1f}%",
            'Precision (Valid)': f"{stats['precision_valid']:.1f}%",
            'BLEU Improvement': f"{stats['bleu_improvement']:.1f}%",
            'Cost': f"${stats['cost']}" if stats['cost'] > 0 else "Free",
            'Time': f"{stats['time_hours']}h"
        })
    
    df = pd.DataFrame(comparison_data)
    print(df.to_string(index=False))

compare_cleaning_models()

## 12. Production Pipeline Implementation

Let's create a production-ready cleaning pipeline based on the paper's findings.

In [None]:
class ProductionCleaningPipeline:
    """Production-ready data cleaning pipeline"""
    
    def __init__(self, 
                 cleaning_model: str = "gpt-3.5-turbo",
                 batch_size: int = 100,
                 save_intermediate: bool = True):
        
        self.cleaning_model = cleaning_model
        self.batch_size = batch_size
        self.save_intermediate = save_intermediate
        
        self.cleaner = SemanticDataCleaner(
            model_name=cleaning_model,
            batch_size=batch_size
        )
        
        self.cleaning_log = []
        
    def clean_full_dataset(self, 
                          train_path: str,
                          val_path: str,
                          output_dir: str) -> Dict:
        """
        Clean complete dataset (train + validation).
        
        Args:
            train_path: Path to training data
            val_path: Path to validation data  
            output_dir: Directory to save cleaned datasets
            
        Returns:
            Dictionary with cleaning results and statistics
        """
        
        print(f"Starting production cleaning pipeline...")
        print(f"Model: {self.cleaning_model}")
        print(f"Batch size: {self.batch_size}")
        
        results = {}
        
        # Clean training set
        print("\n1. Cleaning training dataset...")
        train_data = self._load_data(train_path)
        cleaned_train, removed_train, train_stats = self.cleaner.clean_dataset(train_data)
        
        if self.save_intermediate:
            self._save_partition(cleaned_train, f"{output_dir}/train_cleaned.json")
            self._save_partition(removed_train, f"{output_dir}/train_removed.json")
        
        results['train'] = {
            'cleaned': cleaned_train,
            'removed': removed_train,
            'stats': train_stats
        }
        
        # Clean validation set
        print("\n2. Cleaning validation dataset...")
        val_data = self._load_data(val_path)
        cleaned_val, removed_val, val_stats = self.cleaner.clean_dataset(val_data)
        
        if self.save_intermediate:
            self._save_partition(cleaned_val, f"{output_dir}/val_cleaned.json")
            self._save_partition(removed_val, f"{output_dir}/val_removed.json")
        
        results['val'] = {
            'cleaned': cleaned_val,
            'removed': removed_val,
            'stats': val_stats
        }
        
        # Generate summary report
        self._generate_report(results, f"{output_dir}/cleaning_report.txt")
        
        # Log results
        self.cleaning_log.append({
            'timestamp': datetime.now(),
            'model': self.cleaning_model,
            'results': results
        })
        
        return results
    
    def _load_data(self, path: str) -> List[ReviewComment]:
        """Load data from file (mock implementation)"""
        # In production, implement actual data loading
        return generate_mock_dataset(100)
    
    def _save_partition(self, partition: DatasetPartition, path: str):
        """Save partition to file"""
        # In production, implement actual saving logic
        print(f"  Saved {len(partition.comments)} comments to {path}")
    
    def _generate_report(self, results: Dict, path: str):
        """Generate comprehensive cleaning report"""
        
        report = []
        report.append("=" * 60)
        report.append("DATA CLEANING REPORT")
        report.append("=" * 60)
        report.append(f"\nGenerated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
        report.append(f"Model: {self.cleaning_model}")
        report.append(f"\nSUMMARY")
        report.append("-" * 30)
        
        total_original = 0
        total_cleaned = 0
        
        for split, data in results.items():
            stats = data['stats']
            total_original += stats.original_size
            total_cleaned += stats.cleaned_size
            
            report.append(f"\n{split.upper()} SET:")
            report.append(f"  Original: {stats.original_size:,} comments")
            report.append(f"  Cleaned: {stats.cleaned_size:,} comments ({stats.reduction_percentage:.1f}% reduction)")
            report.append(f"  Valid ratio: {stats.original_valid_ratio:.1%} → {stats.cleaned_valid_ratio:.1%}")
            report.append(f"  Processing time: {stats.processing_time:.1f}s")
        
        report.append(f"\nTOTAL:")
        report.append(f"  Original: {total_original:,} comments")
        report.append(f"  Cleaned: {total_cleaned:,} comments")
        report.append(f"  Overall reduction: {(1 - total_cleaned/total_original)*100:.1f}%")
        
        report.append(f"\nEXPECTED IMPROVEMENTS (based on paper):")
        report.append(f"  BLEU-4: +7.5% to +13.0%")
        report.append(f"  Information score: +24%")
        report.append(f"  Relevance score: +11%")
        
        report_text = "\n".join(report)
        print("\n" + report_text)
        
        # In production, save to file
        # with open(path, 'w') as f:
        #     f.write(report_text)

# Demonstrate production pipeline
pipeline = ProductionCleaningPipeline(
    cleaning_model="gpt-3.5-turbo",
    batch_size=50
)

# Run cleaning (mock paths for demo)
results = pipeline.clean_full_dataset(
    train_path="/path/to/train.json",
    val_path="/path/to/val.json",
    output_dir="/path/to/output"
)

## 13. Key Insights and Best Practices

### Main Findings from the Paper:

1. **Quality > Quantity**: 66% smaller dataset achieves 13% better performance
2. **Valid Ratio Improvement**: From 64% to 85% with GPT-3.5
3. **Cost Efficiency**: $50 vs $25,600 for manual cleaning
4. **Time Efficiency**: 39 hours vs 2,000+ hours manually

### Best Practices for Production:

1. **Choose the Right Model**:
   - GPT-3.5: Best precision (85.1%) but higher cost
   - Llama3: Good balance of performance and cost (free)

2. **Optimize Batch Processing**:
   - Use parallel processing for large datasets
   - Save intermediate results for recovery

3. **Quality Control**:
   - Monitor confidence scores
   - Sample and manually verify borderline cases
   - Track cleaning statistics over time

4. **Iterative Improvement**:
   - Start with aggressive cleaning (high precision)
   - Gradually adjust thresholds based on results
   - Consider ensemble approaches for critical applications

### Future Directions:

1. **Active Learning**: Focus manual review on uncertain cases
2. **Domain Adaptation**: Fine-tune classifiers for specific projects
3. **Multi-stage Cleaning**: Combine multiple models for better results
4. **Continuous Monitoring**: Track model performance drift over time

This semantic data cleaning pipeline demonstrates that thoughtful data curation can dramatically improve model performance while reducing computational costs - a win-win for production ML systems.