# Focused Learning: Dataset Quality Filtering with LLM Classifiers

## Learning Objective
Master the pairwise comparison technique for dataset quality filtering using LLM classifiers, as described in Section III.B of the Kotlin ML Pack paper.

## Paper Reference
- **Section**: III.B - KStack-clean: Learning the code quality
- **Key Insight**: Small curated datasets (25K examples) can outperform large uncurated ones (4M files)
- **Technique**: LLM-based pairwise comparison for quality assessment

## 1. Theoretical Foundation

### 1.1 The Problem
From the paper: "Using curated datasets to fine-tune a model can provide larger improvements than fine-tuning it on a bigger corpus of non-curated data."

### 1.2 The Solution: Pairwise Quality Scoring

The paper introduces a clever scoring formula:

$$s(f) = \frac{(s(f,c)_A - s(f,c)_B) + (s(c,f)_B - s(c,f)_A)}{2}$$

Where:
- $f$ is the file being evaluated
- $c$ is a comparison file
- $s(f,c)_A$ is the probability of choosing A when $f$ is labeled A and $c$ is labeled B
- The formula accounts for ordering bias by testing both directions

### 1.3 Three-Pass Approximation
Since comparing all pairs is $O(n^2)$ complexity, the paper uses a clever three-pass approach:
1. **Pass 1**: Compare random sample against dataset
2. **Pass 2**: Compare highest-scored file from Pass 1 against dataset
3. **Pass 3**: Compare highest-scored file from Pass 2 against dataset
4. **Final Score**: Average of Pass 2 and Pass 3 scores

In [None]:
# Install required packages
!pip install langchain langchain-openai numpy pandas matplotlib seaborn scikit-learn

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from typing import List, Dict, Tuple
from dataclasses import dataclass
import random
from tqdm import tqdm

# LangChain imports
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema import SystemMessage, HumanMessage

# Set visualization style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)

## 2. Implementation of Pairwise Quality Classifier

Let's implement the exact algorithm from the paper with detailed explanations.

In [None]:
@dataclass
class CodeFile:
    """Represents a code file for quality assessment"""
    id: str
    content: str
    score: float = 0.0
    comparisons: int = 0

class PairwiseQualityClassifier:
    """
    Implements the pairwise quality classification from Section III.B.
    
    Key innovation: Uses bidirectional comparison to eliminate ordering bias.
    """
    
    def __init__(self, model_name: str = "gpt-3.5-turbo", use_mock: bool = True):
        """
        Args:
            model_name: LLM to use for comparison
            use_mock: If True, use mock responses for demonstration
        """
        self.use_mock = use_mock
        if not use_mock:
            self.llm = ChatOpenAI(model_name=model_name, temperature=0)
        
        # Prompt directly from the paper
        self.prompt_template = ChatPromptTemplate.from_messages([
            SystemMessage(content="You are evaluating Kotlin code quality."),
            HumanMessage(content="""Compare these two Kotlin code files and determine which has 
greater educational value for learning algorithms in Kotlin.

File A:
{file_a}

File B:
{file_b}

Which file (A or B) has higher educational value? Respond with only 'A' or 'B'.""")
        ])
    
    def _mock_compare(self, file_a: str, file_b: str) -> str:
        """Mock comparison based on code quality heuristics"""
        # Simple heuristics for demonstration
        score_a = (
            len(file_a.split('\n'))  # More lines
            + file_a.count('fun ')    # More functions
            + file_a.count('//')      # More comments
            + file_a.count('class ')  # More classes
        )
        score_b = (
            len(file_b.split('\n'))
            + file_b.count('fun ')
            + file_b.count('//')
            + file_b.count('class ')
        )
        # Add some randomness to simulate LLM uncertainty
        score_a += random.gauss(0, 2)
        score_b += random.gauss(0, 2)
        
        return 'A' if score_a > score_b else 'B'
    
    def compare_files(self, file_a: str, file_b: str) -> str:
        """Compare two files and return which is better ('A' or 'B')"""
        if self.use_mock:
            return self._mock_compare(file_a, file_b)
        else:
            messages = self.prompt_template.format_messages(file_a=file_a, file_b=file_b)
            response = self.llm(messages)
            return response.content.strip()
    
    def calculate_bidirectional_score(self, file_f: str, file_c: str) -> float:
        """
        Calculate score using the paper's formula to eliminate ordering bias.
        
        Formula: s(f) = [(s(f,c)_A - s(f,c)_B) + (s(c,f)_B - s(c,f)_A)] / 2
        """
        # First comparison: f as A, c as B
        result1 = self.compare_files(file_f, file_c)
        score_fc = 1.0 if result1 == 'A' else 0.0
        
        # Second comparison: c as A, f as B (reversed)
        result2 = self.compare_files(file_c, file_f)
        score_cf = 1.0 if result2 == 'B' else 0.0
        
        # Apply the formula
        bidirectional_score = (score_fc + score_cf) / 2
        
        return bidirectional_score

## 3. Three-Pass Scoring Algorithm

The paper's key insight: instead of $O(n^2)$ comparisons, use three strategic passes.

In [None]:
class ThreePassScorer:
    """
    Implements the three-pass approximation algorithm from Section III.B.
    
    This reduces complexity from O(n²) to O(3n) while maintaining quality.
    """
    
    def __init__(self, classifier: PairwiseQualityClassifier):
        self.classifier = classifier
        self.pass_results = []
    
    def score_against_reference(self, files: List[CodeFile], reference: CodeFile) -> Dict[str, float]:
        """
        Score all files against a reference file.
        
        This is one 'pass' in the three-pass algorithm.
        """
        scores = {}
        
        for file in tqdm(files, desc=f"Scoring against {reference.id}"):
            if file.id == reference.id:
                scores[file.id] = 0.5  # Neutral score for self-comparison
            else:
                score = self.classifier.calculate_bidirectional_score(
                    file.content, 
                    reference.content
                )
                scores[file.id] = score
                file.comparisons += 1
        
        return scores
    
    def three_pass_scoring(self, files: List[CodeFile], sample_size: int = None) -> List[CodeFile]:
        """
        Perform the three-pass scoring algorithm from the paper.
        
        Args:
            files: List of code files to score
            sample_size: Size of initial sample (default: all files)
        
        Returns:
            List of files with updated scores
        """
        if sample_size is None:
            sample_size = len(files)
        
        # Sample files for scoring
        sample_files = random.sample(files, min(sample_size, len(files)))
        
        print(f"Starting three-pass scoring on {len(sample_files)} files...")
        
        # Pass 1: Random reference
        print("\nPass 1: Random reference")
        reference1 = random.choice(sample_files)
        scores1 = self.score_against_reference(sample_files, reference1)
        
        # Find highest scored file from Pass 1
        best_id1 = max(scores1, key=scores1.get)
        reference2 = next(f for f in sample_files if f.id == best_id1)
        
        # Pass 2: Best from Pass 1 as reference
        print(f"\nPass 2: Using {reference2.id} as reference")
        scores2 = self.score_against_reference(sample_files, reference2)
        
        # Find highest scored file from Pass 2
        best_id2 = max(scores2, key=scores2.get)
        reference3 = next(f for f in sample_files if f.id == best_id2)
        
        # Pass 3: Best from Pass 2 as reference
        print(f"\nPass 3: Using {reference3.id} as reference")
        scores3 = self.score_against_reference(sample_files, reference3)
        
        # Calculate final scores (average of Pass 2 and Pass 3)
        print("\nCalculating final scores...")
        for file in sample_files:
            # Paper: "averaged the scores from second and third passes"
            file.score = (scores2[file.id] + scores3[file.id]) / 2
        
        # Store results for analysis
        self.pass_results = [
            {"pass": 1, "scores": scores1, "reference": reference1.id},
            {"pass": 2, "scores": scores2, "reference": reference2.id},
            {"pass": 3, "scores": scores3, "reference": reference3.id}
        ]
        
        return sorted(sample_files, key=lambda x: x.score, reverse=True)

## 4. Demonstration with Mock Kotlin Code

Let's create a realistic demonstration using mock Kotlin code files.

In [None]:
# Create mock Kotlin code files with varying quality
mock_kotlin_files = [
    CodeFile(
        id="high_quality_1",
        content="""// Binary Search implementation with detailed explanation
class BinarySearch {
    /**
     * Performs binary search on a sorted array.
     * Time complexity: O(log n)
     * Space complexity: O(1)
     */
    fun search(arr: IntArray, target: Int): Int {
        var left = 0
        var right = arr.size - 1
        
        while (left <= right) {
            val mid = left + (right - left) / 2 // Prevents overflow
            
            when {
                arr[mid] == target -> return mid
                arr[mid] < target -> left = mid + 1
                else -> right = mid - 1
            }
        }
        
        return -1 // Element not found
    }
    
    // Recursive implementation for educational comparison
    fun searchRecursive(arr: IntArray, target: Int, left: Int = 0, right: Int = arr.size - 1): Int {
        if (left > right) return -1
        
        val mid = left + (right - left) / 2
        
        return when {
            arr[mid] == target -> mid
            arr[mid] < target -> searchRecursive(arr, target, mid + 1, right)
            else -> searchRecursive(arr, target, left, mid - 1)
        }
    }
}"""
    ),
    
    CodeFile(
        id="high_quality_2",
        content="""// Graph traversal algorithms with Kotlin idioms
class Graph(private val vertices: Int) {
    private val adjacencyList = Array(vertices) { mutableListOf<Int>() }
    
    fun addEdge(source: Int, destination: Int) {
        adjacencyList[source].add(destination)
        adjacencyList[destination].add(source) // For undirected graph
    }
    
    /**
     * Breadth-First Search using Kotlin collections
     */
    fun bfs(start: Int): List<Int> {
        val visited = BooleanArray(vertices)
        val queue = ArrayDeque<Int>()
        val result = mutableListOf<Int>()
        
        queue.add(start)
        visited[start] = true
        
        while (queue.isNotEmpty()) {
            val current = queue.removeFirst()
            result.add(current)
            
            adjacencyList[current].forEach { neighbor ->
                if (!visited[neighbor]) {
                    visited[neighbor] = true
                    queue.add(neighbor)
                }
            }
        }
        
        return result
    }
    
    /**
     * Depth-First Search with tail recursion optimization
     */
    tailrec fun dfs(current: Int, visited: BooleanArray = BooleanArray(vertices), 
                    result: MutableList<Int> = mutableListOf()): List<Int> {
        visited[current] = true
        result.add(current)
        
        adjacencyList[current]
            .filterNot { visited[it] }
            .forEach { dfs(it, visited, result) }
        
        return result
    }
}"""
    ),
    
    CodeFile(
        id="medium_quality_1",
        content="""// Simple sorting algorithm
fun bubbleSort(arr: IntArray) {
    val n = arr.size
    for (i in 0 until n) {
        for (j in 0 until n - i - 1) {
            if (arr[j] > arr[j + 1]) {
                // Swap elements
                val temp = arr[j]
                arr[j] = arr[j + 1]
                arr[j + 1] = temp
            }
        }
    }
}"""
    ),
    
    CodeFile(
        id="low_quality_1",
        content="""fun doSomething(x: Int): Int {
    return x * 2
}"""
    ),
    
    CodeFile(
        id="low_quality_2",
        content="""val list = listOf(1, 2, 3, 4, 5)
println(list)"""
    )
]

# Create classifier and scorer
classifier = PairwiseQualityClassifier(use_mock=True)
scorer = ThreePassScorer(classifier)

# Run the three-pass algorithm
scored_files = scorer.three_pass_scoring(mock_kotlin_files)

## 5. Analyzing the Results

Let's visualize how the three-pass algorithm converges on quality assessment.

In [None]:
# Display final rankings
print("\nFinal Quality Rankings:")
print("=" * 50)
for i, file in enumerate(scored_files, 1):
    print(f"{i}. {file.id}: Score = {file.score:.3f} (Comparisons: {file.comparisons})")
    print(f"   Preview: {file.content.split(chr(10))[0][:60]}...")
    print()

In [None]:
# Visualize score evolution across passes
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

for i, pass_result in enumerate(scorer.pass_results):
    ax = axes[i]
    
    # Extract scores for each file
    file_ids = list(pass_result["scores"].keys())
    scores = list(pass_result["scores"].values())
    
    # Create bar plot
    bars = ax.bar(range(len(file_ids)), scores)
    
    # Color bars based on score
    colors = ['green' if s > 0.6 else 'orange' if s > 0.4 else 'red' for s in scores]
    for bar, color in zip(bars, colors):
        bar.set_color(color)
    
    ax.set_xlabel('Files')
    ax.set_ylabel('Score')
    ax.set_title(f'Pass {i+1} (Reference: {pass_result["reference"][:15]}...)')
    ax.set_xticks(range(len(file_ids)))
    ax.set_xticklabels([fid[:10] for fid in file_ids], rotation=45, ha='right')
    ax.set_ylim(0, 1)
    ax.axhline(y=0.5, color='black', linestyle='--', alpha=0.3)

plt.tight_layout()
plt.show()

## 6. Key Insights and Learnings

### 6.1 Why This Algorithm Works

1. **Bidirectional Comparison**: Eliminates ordering bias by testing both A vs B and B vs A
2. **Progressive Refinement**: Each pass uses a better reference, converging on true quality
3. **Efficiency**: O(3n) instead of O(n²) comparisons

### 6.2 Critical Implementation Details from the Paper

In [None]:
# Demonstrate the importance of bidirectional scoring
print("Demonstrating Bidirectional Scoring:")
print("=" * 50)

# Test with two files
file1 = mock_kotlin_files[0]  # high_quality_1
file2 = mock_kotlin_files[2]  # medium_quality_1

# One-directional scores
result_12 = classifier.compare_files(file1.content, file2.content)
result_21 = classifier.compare_files(file2.content, file1.content)

print(f"File1 vs File2 (File1 as A): {result_12}")
print(f"File2 vs File1 (File2 as A): {result_21}")

# Bidirectional score
bidirectional = classifier.calculate_bidirectional_score(file1.content, file2.content)
print(f"\nBidirectional score for File1: {bidirectional:.3f}")
print("\nThis accounts for potential ordering bias in the LLM's responses.")

### 6.3 Practical Considerations

From Section III.B and our implementation:

In [None]:
# Calculate computational savings
n_files = 128000  # From the paper
n_sample = 128000  # They scored all 128K files

# Full pairwise comparisons
full_comparisons = n_files * (n_files - 1) // 2

# Three-pass comparisons
three_pass_comparisons = 3 * n_sample

# Savings
savings_ratio = three_pass_comparisons / full_comparisons

print(f"Dataset size: {n_files:,} files")
print(f"\nFull pairwise comparisons: {full_comparisons:,}")
print(f"Three-pass comparisons: {three_pass_comparisons:,}")
print(f"\nComputational savings: {(1 - savings_ratio) * 100:.2f}%")
print(f"\nThis makes the approach feasible even for large datasets!")

## 7. Extension: Binary Classifier Training

The paper mentions training a binary classifier on the scored data. Let's simulate this.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Prepare data for binary classification
# Top 5% as positive (high quality), rest as negative
threshold_percentile = 95
threshold_score = np.percentile([f.score for f in scored_files], threshold_percentile)

# Create labels
X_texts = [f.content for f in scored_files]
y_labels = [1 if f.score >= threshold_score else 0 for f in scored_files]

print(f"Threshold score (top 5%): {threshold_score:.3f}")
print(f"High quality samples: {sum(y_labels)}")
print(f"Low quality samples: {len(y_labels) - sum(y_labels)}")

# Feature extraction (simplified - paper uses CodeT5+)
vectorizer = TfidfVectorizer(max_features=100, ngram_range=(1, 2))
X_features = vectorizer.fit_transform(X_texts)

# Train binary classifier
X_train, X_test, y_train, y_test = train_test_split(
    X_features, y_labels, test_size=0.3, random_state=42
)

classifier = LogisticRegression()
classifier.fit(X_train, y_train)

# Evaluate
y_pred = classifier.predict(X_test)
print("\nClassifier Performance:")
print(classification_report(y_test, y_pred, target_names=['Low Quality', 'High Quality']))

## 8. Replicating Figure 3: Mistral vs GPT-3.5 Classifier

The paper shows that Mistral-based classification outperforms GPT-3.5. Let's understand why.

In [None]:
# Simulate training curves from Figure 3
optimization_steps = np.linspace(0, 1400, 50)

# Paper insight: "noise in the log-probabilities of the completion distribution in the OpenAI API"
# Mistral has more stable log-probabilities

# Simulate the curves
base_pass_rate = 26  # Starting pass rate

# GPT-3.5: Higher noise, lower final improvement
gpt35_noise = np.random.normal(0, 1.5, 50)  # Higher noise
gpt35_curve = base_pass_rate + 10 * (1 - np.exp(-optimization_steps / 400)) + gpt35_noise

# Mistral: Lower noise, better final improvement
mistral_noise = np.random.normal(0, 0.5, 50)  # Lower noise
mistral_curve = base_pass_rate + 14 * (1 - np.exp(-optimization_steps / 300)) + mistral_noise

plt.figure(figsize=(10, 6))
plt.plot(optimization_steps, gpt35_curve, 'b-', label='OpenAI GPT-3.5-based classifier', alpha=0.7)
plt.plot(optimization_steps, mistral_curve, 'orange', label='Mistral-based classifier', alpha=0.7)

# Add smoothed trends
from scipy.ndimage import gaussian_filter1d
plt.plot(optimization_steps, gaussian_filter1d(gpt35_curve, sigma=5), 'b--', linewidth=2)
plt.plot(optimization_steps, gaussian_filter1d(mistral_curve, sigma=5), 'orange', linewidth=2, linestyle='--')

plt.xlabel('Optimization step')
plt.ylabel('Pass rate')
plt.title('Figure 3: Pass rate on HumanEval for Different Filtration Strategies')
plt.legend()
plt.grid(True, alpha=0.3)
plt.ylim(25, 42)

# Add annotation
plt.annotate('Mistral: More stable\nlog-probabilities', 
             xy=(1000, 38), xytext=(700, 35),
             arrowprops=dict(arrowstyle='->', color='orange', alpha=0.5))
plt.annotate('GPT-3.5: Noisy\nlog-probabilities', 
             xy=(1000, 34), xytext=(700, 30),
             arrowprops=dict(arrowstyle='->', color='blue', alpha=0.5))

plt.show()

print("Key Insight from the paper:")
print("GPT-3.5 API adds artificial noise to log-probabilities as defense against distillation.")
print("This noise reduces the effectiveness of the pairwise comparison method.")

## 9. Summary and Practical Applications

### What We Learned

1. **Quality > Quantity**: 25K carefully selected files outperform 4M random files
2. **Clever Algorithms**: Three-pass approximation makes O(n²) problem tractable
3. **Bidirectional Scoring**: Essential for eliminating LLM ordering bias
4. **Model Choice Matters**: Stable log-probabilities (Mistral) beat noisy ones (GPT-3.5)

### Practical Applications

This technique can be applied to:
- Curating training datasets for any programming language
- Quality assessment of code repositories
- Educational content filtering
- Code review prioritization

In [None]:
# Final summary statistics
print("Dataset Quality Filtering Summary:")
print("=" * 50)
print(f"Original dataset: 4,000,000 files")
print(f"After filtering: 25,000 files (0.625% retained)")
print(f"Pass rate improvement: +11.8% (26.09% → 37.89%)")
print(f"\nThis 160x reduction in data size led to significant quality improvements!")