# Focused Learning: Topic Modeling for Quality Evaluation

## Learning Objectives

This notebook explores the innovative semi-automated approach for evaluating comment quality at scale using topic modeling. We'll dive deep into how BERTopic and semantic clustering enable quality assessment of thousands of generated comments without exhaustive manual review.

**What you'll learn:**
1. How to use topic modeling to cluster similar code review comments
2. The theory behind BERTopic and why it outperforms traditional methods
3. How to evaluate information and relevance scores efficiently
4. Practical implementation of semi-automated quality assessment

**Paper Reference**: Section VI-C - Overall Evaluation using Topic Modeling (RQ3)

## 1. Theoretical Foundation

### 1.1 The Challenge of Quality Evaluation at Scale

The paper identifies a critical challenge: manually evaluating thousands of generated comments is infeasible. Traditional metrics like BLEU only measure lexical similarity, not actual quality.

**Quality Dimensions** (Section VI-A):
1. **Information Score (1-5)**: How informative is the comment for code revision?
   - 5: Explicitly points out issues with concrete suggestions
   - 1: Purely seeks clarification without feedback

2. **Relevance Score (1-3)**: How related is the comment to the code change?
   - 3: Explicitly references specific code elements
   - 1: Implicitly related or off-topic

### 1.2 Why Topic Modeling?

Topic modeling enables:
- **Clustering similar comments** → Evaluate representatives instead of all
- **Semantic grouping** → Comments with similar intent grouped together
- **Scalable evaluation** → 50 clusters instead of 10,000+ individual comments

### 1.3 BERTopic vs Traditional Methods

The paper chooses BERTopic over LDA because:
1. **Contextual embeddings**: Uses transformer models for better semantic understanding
2. **Dynamic clustering**: Doesn't require pre-specifying number of topics
3. **Code-aware**: Can use code-specific embedding models (CodeT5+)
4. **High coherence**: Produces more interpretable topics

## 2. Environment Setup

In [None]:
import numpy as np
import pandas as pd
from typing import List, Dict, Tuple, Optional
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm

# Topic modeling
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score

# For embeddings and clustering
import umap
import hdbscan
from sklearn.decomposition import PCA

# For coherence calculation
import gensim
from gensim.models import CoherenceModel
from gensim.corpora import Dictionary

# Visualization
from wordcloud import WordCloud
import plotly.graph_objects as go
import plotly.express as px

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
np.random.seed(42)

print("Environment setup complete!")

## 3. Understanding Quality Metrics

Let's implement the information and relevance scoring systems from the paper.

In [None]:
class QualityMetrics:
    """Implementation of quality metrics from Section VI-A"""
    
    @staticmethod
    def information_score_rubric() -> Dict[int, Dict]:
        """Information score rubric (1-5 scale)"""
        return {
            5: {
                "description": "Very informative with concrete suggestions",
                "examples": [
                    "This method violates SRP. Extract validation logic into validateUser() method",
                    "Replace magic number 86400 with constant SECONDS_PER_DAY for clarity",
                    "Add try-catch to handle potential NullPointerException when user is null"
                ],
                "characteristics": [
                    "Identifies specific issue",
                    "Provides concrete solution",
                    "Explains reasoning"
                ]
            },
            4: {
                "description": "Informative with clear direction",
                "examples": [
                    "Consider using dependency injection here",
                    "This could be simplified with a ternary operator",
                    "Add error handling for edge cases"
                ],
                "characteristics": [
                    "Clear suggestion",
                    "Actionable feedback",
                    "May lack specifics"
                ]
            },
            3: {
                "description": "Moderately informative",
                "examples": [
                    "This method is too long",
                    "Consider refactoring this",
                    "Add documentation"
                ],
                "characteristics": [
                    "General guidance",
                    "Lacks specific solution",
                    "Still actionable"
                ]
            },
            2: {
                "description": "Minimally informative",
                "examples": [
                    "This looks complex",
                    "Not sure about this approach",
                    "Could be better"
                ],
                "characteristics": [
                    "Vague feedback",
                    "No clear action",
                    "Opinion without solution"
                ]
            },
            1: {
                "description": "Not informative (seeks clarification)",
                "examples": [
                    "Why do we need this?",
                    "What does this do?",
                    "???"
                ],
                "characteristics": [
                    "Questions without suggestions",
                    "No actionable feedback",
                    "Purely clarification"
                ]
            }
        }
    
    @staticmethod
    def relevance_score_rubric() -> Dict[int, Dict]:
        """Relevance score rubric (1-3 scale)"""
        return {
            3: {
                "description": "Highly relevant - explicitly references code",
                "examples": [
                    "The variable 'userId' should be renamed to 'userIdentifier'",
                    "Line 45: Add null check before calling user.getName()",
                    "The getUserData() method needs error handling"
                ],
                "characteristics": [
                    "References specific code elements",
                    "Mentions variable/method names",
                    "May include line numbers"
                ]
            },
            2: {
                "description": "Moderately relevant - implicitly related",
                "examples": [
                    "Add error handling here",
                    "This needs documentation",
                    "Consider performance impact"
                ],
                "characteristics": [
                    "General reference to code area",
                    "No specific code elements",
                    "Still contextually appropriate"
                ]
            },
            1: {
                "description": "Low relevance - generic or off-topic",
                "examples": [
                    "Follow coding standards",
                    "This could be improved",
                    "LGTM"
                ],
                "characteristics": [
                    "Generic feedback",
                    "Could apply to any code",
                    "No clear connection to diff"
                ]
            }
        }

# Display rubrics
metrics = QualityMetrics()

print("=== Information Score Rubric ===")
for score, details in metrics.information_score_rubric().items():
    print(f"\nScore {score}: {details['description']}")
    print(f"Example: '{details['examples'][0]}'")

## 4. Implementing Quality Scorer

Let's build an automated scorer that approximates manual evaluation.

In [None]:
class AutomatedQualityScorer:
    """Automated scoring based on comment characteristics"""
    
    def __init__(self):
        # Keywords indicating different quality levels
        self.action_keywords = [
            'should', 'consider', 'suggest', 'recommend', 'must',
            'add', 'remove', 'extract', 'refactor', 'rename',
            'replace', 'use', 'implement', 'fix', 'update'
        ]
        
        self.concrete_keywords = [
            'method', 'function', 'variable', 'class', 'constant',
            'parameter', 'return', 'exception', 'import', 'package'
        ]
        
        self.vague_keywords = [
            'maybe', 'perhaps', 'not sure', 'hmm', 'weird',
            'strange', 'confusing', 'unclear', 'complicated'
        ]
        
        self.question_keywords = [
            'why', 'what', 'how', 'when', 'where', 'which',
            '?', 'clarify', 'explain'
        ]
    
    def score_information(self, comment: str) -> int:
        """Score comment informativeness (1-5)"""
        comment_lower = comment.lower()
        score = 1  # Base score
        
        # Check for action words (+1)
        if any(word in comment_lower for word in self.action_keywords):
            score += 1
        
        # Check for concrete technical terms (+1)
        if any(word in comment_lower for word in self.concrete_keywords):
            score += 1
        
        # Check for specific suggestions (+1)
        if ('instead' in comment_lower or 'rather than' in comment_lower or 
            '->' in comment or '=>' in comment):
            score += 1
        
        # Check for reasoning (+1)
        if any(word in comment_lower for word in ['because', 'since', 'improve', 'better', 'cleaner']):
            score += 1
        
        # Penalize vague language (-1)
        if any(word in comment_lower for word in self.vague_keywords):
            score = max(1, score - 1)
        
        # Penalize pure questions (-1)
        if comment.strip().endswith('?') and len(comment.split()) < 10:
            score = max(1, score - 1)
        
        return min(5, score)
    
    def score_relevance(self, comment: str, code_diff: Optional[str] = None) -> int:
        """Score comment relevance to code (1-3)"""
        score = 1  # Base score
        
        if not code_diff:
            # Without code diff, use heuristics
            if any(word in comment for word in ['line', 'Line', 'L']):
                score = 3
            elif re.search(r'\b\w+\(\)', comment):  # Method calls
                score = 3
            elif re.search(r'\$\w+|\w+\$', comment):  # Variable references
                score = 3
            elif any(word in comment.lower() for word in self.concrete_keywords):
                score = 2
        else:
            # With code diff, check overlap
            code_tokens = set(re.findall(r'\b\w+\b', code_diff))
            comment_tokens = set(re.findall(r'\b\w+\b', comment))
            
            overlap = len(code_tokens.intersection(comment_tokens))
            if overlap >= 3:
                score = 3
            elif overlap >= 1:
                score = 2
        
        return score
    
    def score_batch(self, comments: List[Dict]) -> pd.DataFrame:
        """Score a batch of comments"""
        results = []
        
        for item in comments:
            comment = item['comment']
            code_diff = item.get('code_diff', None)
            
            info_score = self.score_information(comment)
            rel_score = self.score_relevance(comment, code_diff)
            
            results.append({
                'comment': comment,
                'information_score': info_score,
                'relevance_score': rel_score,
                'quality_score': (info_score / 5 * 0.7) + (rel_score / 3 * 0.3)  # Weighted
            })
        
        return pd.DataFrame(results)

# Test the scorer
import re
scorer = AutomatedQualityScorer()

test_comments = [
    {"comment": "Why do we have this flag?"},
    {"comment": "Consider extracting this logic into a separate validateUser() method for better modularity"},
    {"comment": "The variable userId should be renamed to userIdentifier for consistency"},
    {"comment": "What does this do?"},
    {"comment": "Add null check before accessing user.getName() to prevent NPE"}
]

scores_df = scorer.score_batch(test_comments)
print("Sample Scoring Results:")
print(scores_df.to_string(index=False, max_colwidth=50))

## 5. Creating Mock Generated Comments Dataset

Let's create datasets simulating comments from original and cleaned models.

In [None]:
def generate_mock_comments(model_type: str, n_comments: int = 1000) -> List[Dict]:
    """Generate mock comments simulating different model outputs"""
    
    # Templates based on model type (reflecting paper's findings)
    if model_type == "original":
        # Original model: mix of valid and noisy patterns
        templates = [
            # Noisy (40%)
            "Why this change?",
            "What is the purpose of this?",
            "I don't understand this",
            "Is this necessary?",
            "???",
            "Not sure about this",
            "This looks weird",
            "hmm",
            
            # Semi-valid (30%)
            "Consider refactoring",
            "This could be improved",
            "Add documentation",
            "Check error handling",
            "Review this logic",
            
            # Valid (30%)
            "Extract this into a separate method",
            "Use a constant instead of magic number {}",
            "Add null check for {}",
            "Rename {} to {} for clarity",
            "This violates single responsibility principle"
        ]
        weights = [0.05] * 8 + [0.06] * 5 + [0.06] * 5
        
    elif model_type == "cleaned":
        # Cleaned model: mostly valid, specific suggestions
        templates = [
            # High quality (60%)
            "Extract {} logic into separate {} method for better modularity",
            "Replace magic number {} with constant {} for maintainability",
            "Add try-catch to handle {} when {} is null",
            "Rename variable {} to {} to follow naming conventions",
            "This method violates SRP. Split into {} and {}",
            "Use dependency injection instead of creating {} directly",
            "Consider using {} pattern here for better extensibility",
            "Add unit tests to cover the edge case when {} is {}",
            
            # Medium quality (30%)
            "This could be simplified using {}",
            "Consider caching {} for performance",
            "Add validation for {} parameter",
            "Document the purpose of this {}",
            
            # Low quality (10%)
            "Refactor this method",
            "Improve naming"
        ]
        weights = [0.075] * 8 + [0.075] * 4 + [0.05] * 2
    
    # Generate comments
    comments = []
    
    # Common fill values
    variables = ['userId', 'userData', 'config', 'result', 'response']
    methods = ['validate', 'process', 'calculate', 'fetch', 'update']
    patterns = ['Factory', 'Observer', 'Strategy', 'Singleton']
    numbers = ['3600', '86400', '1000', '256', '42']
    
    for i in range(n_comments):
        template = np.random.choice(templates, p=weights)
        
        # Fill template with random values
        if '{}' in template:
            n_placeholders = template.count('{}')
            if 'magic number' in template:
                values = [np.random.choice(numbers), 
                         f"{np.random.choice(['MAX_', 'DEFAULT_', ''])}{np.random.choice(['TIMEOUT', 'SIZE', 'COUNT'])}"]
            elif 'variable' in template or 'Rename' in template:
                values = [np.random.choice(variables), 
                         np.random.choice(variables) + np.random.choice(['Id', 'Data', 'Info'])]  
            else:
                values = [np.random.choice(variables + methods + patterns) 
                         for _ in range(n_placeholders)]
            
            comment = template.format(*values[:n_placeholders])
        else:
            comment = template
        
        # Add mock code diff
        code_diff = f"+ {np.random.choice(variables)} = {np.random.choice(methods)}();"
        
        comments.append({
            'id': f'comment_{i}',
            'comment': comment,
            'code_diff': code_diff,
            'model': model_type
        })
    
    return comments

# Generate datasets
original_comments = generate_mock_comments('original', 500)
cleaned_comments = generate_mock_comments('cleaned', 500)

print(f"Generated {len(original_comments)} comments from original model")
print(f"Generated {len(cleaned_comments)} comments from cleaned model")

# Sample comments
print("\nSample from original model:")
for c in original_comments[:3]:
    print(f"  - {c['comment']}")
    
print("\nSample from cleaned model:")  
for c in cleaned_comments[:3]:
    print(f"  - {c['comment']}")

## 6. Implementing BERTopic for Comment Clustering

Now let's implement the topic modeling approach from Section VI-C.

In [None]:
class CommentTopicModeler:
    """Topic modeling for code review comments using BERTopic"""
    
    def __init__(self, 
                 embedding_model: str = 'all-MiniLM-L6-v2',
                 n_topics: int = 50,
                 min_topic_size: int = 10):
        """
        Initialize topic modeler.
        
        Paper uses:
        - CodeT5+ for embeddings (we'll use all-MiniLM-L6-v2 as substitute)
        - Agglomerative clustering
        - 50 topics for ~10k comments
        """
        
        # Embedding model
        self.sentence_model = SentenceTransformer(embedding_model)
        
        # Clustering model (paper uses agglomerative)
        self.cluster_model = AgglomerativeClustering(
            n_clusters=n_topics,
            linkage='ward'
        )
        
        # Initialize BERTopic
        self.topic_model = BERTopic(
            embedding_model=self.sentence_model,
            hdbscan_model=self.cluster_model,
            nr_topics=n_topics,
            min_topic_size=min_topic_size,
            calculate_probabilities=True,
            verbose=True
        )
        
        self.topics = None
        self.probs = None
        self.topic_info = None
        
    def fit_transform(self, comments: List[str]) -> Tuple:
        """Fit topic model and transform comments"""
        
        print(f"Fitting topic model on {len(comments)} comments...")
        self.topics, self.probs = self.topic_model.fit_transform(comments)
        
        # Get topic information
        self.topic_info = self.topic_model.get_topic_info()
        
        print(f"Found {len(self.topic_info) - 1} topics (excluding outliers)")
        
        return self.topics, self.probs
    
    def calculate_coherence(self, comments: List[str]) -> float:
        """Calculate topic coherence score"""
        
        # Get topics and their words
        topics_words = []
        for topic_id in range(len(self.topic_info) - 1):
            if topic_id != -1:  # Skip outlier topic
                words = self.topic_model.get_topic(topic_id)
                topics_words.append([word[0] for word in words[:10]])
        
        # Tokenize documents
        tokenized_docs = [doc.lower().split() for doc in comments]
        
        # Create dictionary and corpus
        dictionary = Dictionary(tokenized_docs)
        corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]
        
        # Calculate coherence
        coherence_model = CoherenceModel(
            topics=topics_words,
            texts=tokenized_docs,
            dictionary=dictionary,
            coherence='c_v'
        )
        
        return coherence_model.get_coherence()
    
    def get_representative_comments(self, comments: List[str], n_per_topic: int = 3) -> Dict:
        """Get representative comments for each topic"""
        
        representatives = {}
        
        for topic_id in range(len(self.topic_info) - 1):
            if topic_id == -1:
                continue
                
            # Get indices of comments in this topic
            topic_indices = [i for i, t in enumerate(self.topics) if t == topic_id]
            
            if not topic_indices:
                continue
            
            # Get probabilities for these comments
            topic_probs = [self.probs[i][topic_id] if topic_id < len(self.probs[i]) else 0 
                          for i in topic_indices]
            
            # Sort by probability and get top N
            sorted_indices = sorted(zip(topic_indices, topic_probs), 
                                  key=lambda x: x[1], reverse=True)
            
            # Get representative comments
            reps = []
            for idx, prob in sorted_indices[:n_per_topic]:
                reps.append({
                    'comment': comments[idx],
                    'probability': prob,
                    'index': idx
                })
            
            representatives[topic_id] = reps
        
        return representatives
    
    def visualize_topics(self, comments: List[str]):
        """Create visualizations of topic distribution"""
        
        fig, axes = plt.subplots(2, 2, figsize=(15, 12))
        
        # 1. Topic size distribution
        topic_sizes = self.topic_info[self.topic_info.Topic != -1]['Count'].values
        axes[0, 0].bar(range(len(topic_sizes)), topic_sizes)
        axes[0, 0].set_xlabel('Topic ID')
        axes[0, 0].set_ylabel('Number of Comments')
        axes[0, 0].set_title('Topic Size Distribution')
        
        # 2. Top topics
        top_topics = self.topic_info[self.topic_info.Topic != -1].nlargest(10, 'Count')
        axes[0, 1].barh(top_topics['Name'], top_topics['Count'])
        axes[0, 1].set_xlabel('Number of Comments')
        axes[0, 1].set_title('Top 10 Topics')
        
        # 3. Word cloud of top topic
        if len(self.topic_info) > 1:
            top_topic_words = self.topic_model.get_topic(0)
            word_freq = {word: score for word, score in top_topic_words[:20]}
            
            wordcloud = WordCloud(width=400, height=400, 
                                 background_color='white').generate_from_frequencies(word_freq)
            
            axes[1, 0].imshow(wordcloud, interpolation='bilinear')
            axes[1, 0].axis('off')
            axes[1, 0].set_title('Top Topic Word Cloud')
        
        # 4. Coherence by topic
        # Simplified visualization - in practice, calculate per-topic coherence
        axes[1, 1].text(0.5, 0.5, f'Overall Coherence\n{self.calculate_coherence(comments):.3f}', 
                       ha='center', va='center', fontsize=24)
        axes[1, 1].set_xlim(0, 1)
        axes[1, 1].set_ylim(0, 1)
        axes[1, 1].axis('off')
        axes[1, 1].set_title('Model Coherence Score')
        
        plt.tight_layout()
        plt.show()

# Apply topic modeling to our mock comments
all_comments_text = [c['comment'] for c in original_comments + cleaned_comments]

# Initialize and fit model
topic_modeler = CommentTopicModeler(n_topics=20)  # Fewer topics for demo
topics, probs = topic_modeler.fit_transform(all_comments_text[:200])  # Subset for demo

# Visualize results
topic_modeler.visualize_topics(all_comments_text[:200])

## 7. Quality Evaluation via Topic Representatives

Now let's implement the semi-automated quality evaluation approach.

In [None]:
class TopicBasedQualityEvaluator:
    """Evaluate quality using topic modeling approach from paper"""
    
    def __init__(self, topic_modeler: CommentTopicModeler, quality_scorer: AutomatedQualityScorer):
        self.topic_modeler = topic_modeler
        self.quality_scorer = quality_scorer
        
    def evaluate_representatives(self, 
                               comments_data: List[Dict],
                               n_representatives: int = 3) -> pd.DataFrame:
        """
        Evaluate quality by scoring topic representatives.
        
        This implements the approach from Section VI-C:
        1. Get top 3 representatives per topic
        2. Score them for information and relevance
        3. Use average as topic score
        4. Apply to all comments in topic
        """
        
        comments_text = [c['comment'] for c in comments_data]
        
        # Get representatives
        representatives = self.topic_modeler.get_representative_comments(
            comments_text, 
            n_per_topic=n_representatives
        )
        
        # Score representatives and calculate topic scores
        topic_scores = {}
        
        for topic_id, reps in representatives.items():
            info_scores = []
            rel_scores = []
            
            for rep in reps:
                idx = rep['index']
                comment_data = comments_data[idx]
                
                info_score = self.quality_scorer.score_information(comment_data['comment'])
                rel_score = self.quality_scorer.score_relevance(
                    comment_data['comment'], 
                    comment_data.get('code_diff')
                )
                
                info_scores.append(info_score)
                rel_scores.append(rel_score)
            
            # Average scores for topic
            topic_scores[topic_id] = {
                'avg_information': np.mean(info_scores),
                'avg_relevance': np.mean(rel_scores),
                'n_comments': sum(1 for t in self.topic_modeler.topics if t == topic_id)
            }
        
        # Apply topic scores to all comments
        results = []
        
        for i, (comment_data, topic) in enumerate(zip(comments_data, self.topic_modeler.topics)):
            if topic in topic_scores:
                info_score = topic_scores[topic]['avg_information']
                rel_score = topic_scores[topic]['avg_relevance']
            else:
                # Outliers - score individually
                info_score = self.quality_scorer.score_information(comment_data['comment'])
                rel_score = self.quality_scorer.score_relevance(
                    comment_data['comment'],
                    comment_data.get('code_diff')
                )
            
            results.append({
                'comment': comment_data['comment'][:50] + '...',
                'model': comment_data.get('model', 'unknown'),
                'topic': topic,
                'information_score': info_score,
                'relevance_score': rel_score,
                'quality_score': (info_score / 5 * 0.7) + (rel_score / 3 * 0.3)
            })
        
        return pd.DataFrame(results)
    
    def compare_model_quality(self, results_df: pd.DataFrame) -> Dict:
        """Compare quality between models"""
        
        comparison = {}
        
        for model in results_df['model'].unique():
            model_data = results_df[results_df['model'] == model]
            
            comparison[model] = {
                'avg_information': model_data['information_score'].mean(),
                'avg_relevance': model_data['relevance_score'].mean(),
                'avg_quality': model_data['quality_score'].mean(),
                
                # Distribution of scores
                'info_distribution': model_data['information_score'].value_counts().to_dict(),
                'rel_distribution': model_data['relevance_score'].value_counts().to_dict(),
                
                # Low quality percentage (info <= 2 or rel == 1)
                'low_info_pct': (model_data['information_score'] <= 2).mean() * 100,
                'low_rel_pct': (model_data['relevance_score'] == 1).mean() * 100
            }
        
        return comparison

# Evaluate quality using topic modeling
evaluator = TopicBasedQualityEvaluator(topic_modeler, scorer)

# Combine datasets for evaluation
all_comments_data = original_comments[:100] + cleaned_comments[:100]

# Run evaluation
print("Evaluating comment quality via topic representatives...")
quality_results = evaluator.evaluate_representatives(all_comments_data)

# Compare models
model_comparison = evaluator.compare_model_quality(quality_results)

print("\n=== Model Quality Comparison ===")
for model, metrics in model_comparison.items():
    print(f"\n{model.upper()} Model:")
    print(f"  Average Information Score: {metrics['avg_information']:.2f}/5")
    print(f"  Average Relevance Score: {metrics['avg_relevance']:.2f}/3")
    print(f"  Low Quality Comments: {metrics['low_info_pct']:.1f}% (info), {metrics['low_rel_pct']:.1f}% (rel)")

## 8. Visualizing Quality Improvements

Let's create visualizations matching Figure 4 from the paper.

In [None]:
def visualize_quality_improvements(quality_results: pd.DataFrame, model_comparison: Dict):
    """Create visualizations similar to Figure 4 in the paper"""
    
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
    
    # Colors for models
    colors = {'original': '#e74c3c', 'cleaned': '#2ecc71'}
    
    # 1. Information score distribution
    for model in ['original', 'cleaned']:
        model_data = quality_results[quality_results['model'] == model]
        info_counts = model_data['information_score'].value_counts().sort_index()
        
        x = info_counts.index
        y = info_counts.values
        
        axes[0, 0].bar(x + (0.2 if model == 'cleaned' else -0.2), y, 
                      width=0.4, label=model.capitalize(), 
                      color=colors[model], alpha=0.7)
    
    axes[0, 0].set_xlabel('Information Score')
    axes[0, 0].set_ylabel('Number of Comments')
    axes[0, 0].set_title('Information Score Distribution')
    axes[0, 0].legend()
    axes[0, 0].set_xticks(range(1, 6))
    
    # 2. Relevance score distribution
    for model in ['original', 'cleaned']:
        model_data = quality_results[quality_results['model'] == model]
        rel_counts = model_data['relevance_score'].value_counts().sort_index()
        
        x = rel_counts.index
        y = rel_counts.values
        
        axes[0, 1].bar(x + (0.1 if model == 'cleaned' else -0.1), y,
                      width=0.2, label=model.capitalize(),
                      color=colors[model], alpha=0.7)
    
    axes[0, 1].set_xlabel('Relevance Score')
    axes[0, 1].set_ylabel('Number of Comments')
    axes[0, 1].set_title('Relevance Score Distribution')
    axes[0, 1].legend()
    axes[0, 1].set_xticks(range(1, 4))
    
    # 3. Average scores comparison
    models = list(model_comparison.keys())
    info_scores = [model_comparison[m]['avg_information'] for m in models]
    rel_scores = [model_comparison[m]['avg_relevance'] for m in models]
    
    x = np.arange(len(models))
    width = 0.35
    
    bars1 = axes[1, 0].bar(x - width/2, info_scores, width, 
                           label='Information', color='#3498db')
    bars2 = axes[1, 0].bar(x + width/2, [r * 5/3 for r in rel_scores], width,
                           label='Relevance (scaled)', color='#9b59b6')
    
    axes[1, 0].set_ylabel('Average Score')
    axes[1, 0].set_title('Average Quality Scores by Model')
    axes[1, 0].set_xticks(x)
    axes[1, 0].set_xticklabels([m.capitalize() for m in models])
    axes[1, 0].legend()
    axes[1, 0].set_ylim(0, 5)
    
    # Add value labels
    for bars, scores in [(bars1, info_scores), (bars2, [r * 5/3 for r in rel_scores])]:
        for bar, score in zip(bars, scores):
            axes[1, 0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.1,
                           f'{score:.2f}', ha='center', fontsize=10)
    
    # 4. Improvement percentages (matching paper's findings)
    improvements = {
        'Information': 24,  # From paper
        'Relevance': 11,    # From paper  
        'Low Info Reduction': 73,  # From paper
        'Low Rel Reduction': 61    # From paper
    }
    
    categories = list(improvements.keys())
    values = list(improvements.values())
    
    bars = axes[1, 1].bar(categories, values, color=['#2ecc71', '#3498db', '#e74c3c', '#f39c12'])
    axes[1, 1].set_ylabel('Improvement (%)')
    axes[1, 1].set_title('Quality Improvements (Paper Results)')
    axes[1, 1].set_xticklabels(categories, rotation=45, ha='right')
    
    # Add percentage labels
    for bar, val in zip(bars, values):
        axes[1, 1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1,
                       f'{val}%', ha='center')
    
    plt.tight_layout()
    plt.show()

# Create visualizations
visualize_quality_improvements(quality_results, model_comparison)

## 9. Interactive Topic Explorer

Let's create an interactive way to explore topics and their quality.

In [None]:
def create_topic_quality_dashboard(topic_modeler: CommentTopicModeler, 
                                 quality_results: pd.DataFrame,
                                 comments_data: List[Dict]):
    """Create interactive dashboard for topic exploration"""
    
    # Calculate topic-level quality metrics
    topic_quality = quality_results.groupby('topic').agg({
        'information_score': ['mean', 'std', 'count'],
        'relevance_score': ['mean', 'std'],
        'quality_score': 'mean'
    }).round(2)
    
    # Flatten column names
    topic_quality.columns = ['_'.join(col).strip() for col in topic_quality.columns.values]
    topic_quality = topic_quality.reset_index()
    
    # Add topic names
    topic_names = []
    for topic_id in topic_quality['topic']:
        if topic_id == -1:
            topic_names.append('Outliers')
        else:
            # Get top words for topic
            words = topic_modeler.topic_model.get_topic(topic_id)[:3]
            topic_name = ', '.join([w[0] for w in words])
            topic_names.append(f"Topic {topic_id}: {topic_name}")
    
    topic_quality['topic_name'] = topic_names
    
    # Display top and bottom quality topics
    print("=== Topic Quality Analysis ===")
    print("\nTop 5 Highest Quality Topics:")
    top_topics = topic_quality.nlargest(5, 'quality_score_mean')
    print(top_topics[['topic_name', 'information_score_mean', 
                     'relevance_score_mean', 'information_score_count']].to_string(index=False))
    
    print("\nBottom 5 Lowest Quality Topics:")
    bottom_topics = topic_quality.nsmallest(5, 'quality_score_mean')
    print(bottom_topics[['topic_name', 'information_score_mean',
                        'relevance_score_mean', 'information_score_count']].to_string(index=False))
    
    # Example comments from best and worst topics
    print("\n=== Example Comments ===")
    
    best_topic = top_topics.iloc[0]['topic']
    worst_topic = bottom_topics.iloc[0]['topic']
    
    print(f"\nBest Topic ({top_topics.iloc[0]['topic_name']}):")
    best_examples = [comments_data[i] for i, t in enumerate(topic_modeler.topics[:len(comments_data)]) 
                    if t == best_topic][:3]
    for ex in best_examples:
        print(f"  - {ex['comment']}")
    
    print(f"\nWorst Topic ({bottom_topics.iloc[0]['topic_name']}):")
    worst_examples = [comments_data[i] for i, t in enumerate(topic_modeler.topics[:len(comments_data)])
                     if t == worst_topic][:3]
    for ex in worst_examples:
        print(f"  - {ex['comment']}")
    
    return topic_quality

# Create dashboard
topic_quality_df = create_topic_quality_dashboard(
    topic_modeler,
    quality_results,
    all_comments_data
)

## 10. Implementing the Complete Evaluation Pipeline

Let's create a production-ready pipeline that combines all components.

In [None]:
class ProductionQualityEvaluator:
    """Complete pipeline for quality evaluation using topic modeling"""
    
    def __init__(self,
                 embedding_model: str = 'all-MiniLM-L6-v2',
                 n_topics: int = 50,
                 n_representatives: int = 3):
        
        self.embedding_model = embedding_model
        self.n_topics = n_topics
        self.n_representatives = n_representatives
        
        # Initialize components
        self.scorer = AutomatedQualityScorer()
        self.topic_modeler = None
        self.evaluator = None
        
    def evaluate_model_outputs(self,
                             original_comments: List[Dict],
                             cleaned_comments: List[Dict]) -> Dict:
        """
        Complete evaluation pipeline:
        1. Combine datasets
        2. Fit topic model
        3. Evaluate representatives
        4. Calculate improvements
        5. Generate report
        """
        
        print("=== Starting Quality Evaluation Pipeline ===")
        
        # Step 1: Prepare data
        print("\n1. Preparing datasets...")
        all_comments = original_comments + cleaned_comments
        all_text = [c['comment'] for c in all_comments]
        
        print(f"   Total comments: {len(all_comments)}")
        print(f"   Original model: {len(original_comments)}")
        print(f"   Cleaned model: {len(cleaned_comments)}")
        
        # Step 2: Topic modeling
        print("\n2. Fitting topic model...")
        self.topic_modeler = CommentTopicModeler(
            embedding_model=self.embedding_model,
            n_topics=self.n_topics
        )
        topics, probs = self.topic_modeler.fit_transform(all_text)
        
        # Calculate coherence
        coherence = self.topic_modeler.calculate_coherence(all_text)
        print(f"   Topic coherence: {coherence:.3f}")
        
        # Step 3: Quality evaluation
        print("\n3. Evaluating quality via representatives...")
        self.evaluator = TopicBasedQualityEvaluator(
            self.topic_modeler,
            self.scorer
        )
        
        quality_results = self.evaluator.evaluate_representatives(
            all_comments,
            n_representatives=self.n_representatives
        )
        
        # Step 4: Calculate improvements
        print("\n4. Calculating improvements...")
        model_comparison = self.evaluator.compare_model_quality(quality_results)
        
        # Calculate percentage improvements
        orig_metrics = model_comparison['original']
        clean_metrics = model_comparison['cleaned']
        
        improvements = {
            'information_improvement': (
                (clean_metrics['avg_information'] - orig_metrics['avg_information']) / 
                orig_metrics['avg_information'] * 100
            ),
            'relevance_improvement': (
                (clean_metrics['avg_relevance'] - orig_metrics['avg_relevance']) /
                orig_metrics['avg_relevance'] * 100
            ),
            'low_info_reduction': (
                (orig_metrics['low_info_pct'] - clean_metrics['low_info_pct']) /
                orig_metrics['low_info_pct'] * 100
            ),
            'low_rel_reduction': (
                (orig_metrics['low_rel_pct'] - clean_metrics['low_rel_pct']) /
                orig_metrics['low_rel_pct'] * 100 if orig_metrics['low_rel_pct'] > 0 else 0
            )
        }
        
        # Step 5: Generate report
        print("\n5. Generating evaluation report...")
        report = self._generate_report(
            model_comparison,
            improvements,
            coherence,
            quality_results
        )
        
        return {
            'quality_results': quality_results,
            'model_comparison': model_comparison,
            'improvements': improvements,
            'coherence': coherence,
            'report': report
        }
    
    def _generate_report(self, 
                        model_comparison: Dict,
                        improvements: Dict,
                        coherence: float,
                        quality_results: pd.DataFrame) -> str:
        """Generate comprehensive evaluation report"""
        
        report_lines = [
            "=" * 60,
            "QUALITY EVALUATION REPORT",
            "=" * 60,
            f"\nEvaluation Method: Topic Modeling with {self.n_topics} topics",
            f"Embedding Model: {self.embedding_model}",
            f"Representatives per Topic: {self.n_representatives}",
            f"Topic Coherence Score: {coherence:.3f}",
            
            "\n" + "=" * 60,
            "MODEL COMPARISON",
            "=" * 60,
        ]
        
        for model, metrics in model_comparison.items():
            report_lines.extend([
                f"\n{model.upper()} MODEL:",
                f"  Average Information Score: {metrics['avg_information']:.2f}/5.0",
                f"  Average Relevance Score: {metrics['avg_relevance']:.2f}/3.0",
                f"  Overall Quality Score: {metrics['avg_quality']:.3f}",
                f"  Low Quality Comments:",
                f"    - Low Information (≤2): {metrics['low_info_pct']:.1f}%",
                f"    - Low Relevance (=1): {metrics['low_rel_pct']:.1f}%"
            ])
        
        report_lines.extend([
            "\n" + "=" * 60,
            "IMPROVEMENTS",
            "=" * 60,
            f"\nInformation Score: {improvements['information_improvement']:+.1f}%",
            f"Relevance Score: {improvements['relevance_improvement']:+.1f}%",
            f"Low Information Reduction: {improvements['low_info_reduction']:.1f}%",
            f"Low Relevance Reduction: {improvements['low_rel_reduction']:.1f}%",
            
            "\n" + "=" * 60,
            "COMPARISON WITH PAPER RESULTS",
            "=" * 60,
            "\nPaper reported improvements:",
            "  - Information: +24%",
            "  - Relevance: +11%",
            "  - Low Info Reduction: 73-80%",
            "  - Low Rel Reduction: 61-72%",
            
            "\n" + "=" * 60,
            "CONCLUSION",
            "=" * 60,
            "\nThe cleaned model demonstrates substantial improvements in",
            "comment quality across all metrics, validating the effectiveness",
            "of semantic data cleaning for code review comment generation."
        ])
        
        report = "\n".join(report_lines)
        print("\n" + report)
        
        return report

# Run complete evaluation
production_evaluator = ProductionQualityEvaluator(n_topics=10)  # Fewer topics for demo

evaluation_results = production_evaluator.evaluate_model_outputs(
    original_comments[:100],
    cleaned_comments[:100]
)

## 11. Key Insights and Best Practices

### Main Findings from the Paper:

1. **Topic Modeling Enables Scale**: Evaluate 50 topics instead of 10,000+ comments
2. **High Coherence (0.67+)**: Well-formed clusters enable reliable evaluation
3. **Quality Improvements**:
   - Information: +24%
   - Relevance: +11%
   - Low quality reduction: 73-80%

### Best Practices for Implementation:

1. **Choose Right Embedding Model**:
   - Use code-aware models (CodeT5+, CodeBERT)
   - Fine-tune on your domain if possible

2. **Optimize Topic Count**:
   - ~50 topics per 10k comments
   - Higher coherence > more topics

3. **Representative Selection**:
   - Top 3 by probability
   - Manual verify borderline topics

4. **Quality Metrics**:
   - Weight information higher (70%)
   - Consider domain-specific rubrics

### Limitations and Considerations:

1. **Approximation**: Topic average may not capture outliers
2. **Coherence Dependency**: Low coherence → unreliable evaluation  
3. **Manual Validation**: Still need spot checks on representatives
4. **Domain Specificity**: Rubrics may need adjustment per project

This approach provides a scalable, semi-automated method for evaluating the quality of generated code review comments, demonstrating that cleaned datasets produce significantly higher quality outputs.