# Focused Learning: Advanced Contamination Detection & Dataset Versioning

## Learning Objectives
1. Master advanced techniques for detecting data contamination in LLM evaluation
2. Understand the difference between memorization and true generalization
3. Implement statistical methods for quantifying contamination risk
4. Build robust dataset versioning systems for temporal benchmarks
5. Learn to detect subtle forms of information leakage

## Concept Source
- **Paper Section**: Section 3 (Contamination Analysis) - Extended from basic temporal analysis
- **Key Insight**: "The minimal temporal overlap between GPT-4o-0806's release date and our test problem release window strongly suggests authentic model capability measurements"
- **Research Gap**: Paper only covers basic temporal analysis - we need deeper contamination detection

## 1. The Multi-Dimensional Contamination Problem

### Beyond Simple Temporal Splits

While the paper uses temporal splits as the primary contamination prevention method, real-world contamination is much more complex:

1. **Direct Memorization**: Model has seen exact problem in training
2. **Indirect Exposure**: Similar problems or solution patterns
3. **Cross-Pollination**: Information leaked through related datasets
4. **Synthetic Contamination**: Training data generated from test sets
5. **Human Contamination**: Evaluators unconsciously biased by known solutions

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from typing import List, Dict, Tuple, Optional, Set
from dataclasses import dataclass
from datetime import datetime, timedelta
import hashlib
import json
from scipy import stats
from scipy.spatial.distance import cosine, euclidean
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import PCA
from sklearn.cluster import DBSCAN
import re
import warnings
warnings.filterwarnings('ignore')

# Set up visualization
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)

## 2. Problem Similarity Detection

### Detecting Near-Duplicate Problems Across Time

In [None]:
@dataclass
class Problem:
    """Enhanced problem representation for contamination analysis"""
    id: str
    title: str
    description: str
    solution_code: str
    test_cases: List[Dict]
    release_date: datetime
    difficulty: str
    tags: List[str]
    source_platform: str
    language: str = "python"

class ProblemSimilarityDetector:
    """Advanced similarity detection for identifying potential contamination"""
    
    def __init__(self):
        self.vectorizer = TfidfVectorizer(
            max_features=1000,
            stop_words='english',
            ngram_range=(1, 3),
            min_df=2
        )
        self.similarity_threshold = 0.8
        
    def extract_problem_features(self, problem: Problem) -> Dict:
        """Extract multi-dimensional features from problem"""
        features = {
            # Text features
            'description_hash': hashlib.md5(problem.description.encode()).hexdigest(),
            'description_length': len(problem.description),
            'word_count': len(problem.description.split()),
            
            # Algorithmic features
            'algorithm_keywords': self._extract_algorithm_keywords(problem.description),
            'constraint_signature': self._extract_constraint_signature(problem.description),
            'io_pattern': self._extract_io_pattern(problem.test_cases),
            
            # Code features
            'solution_complexity': self._analyze_solution_complexity(problem.solution_code),
            'code_patterns': self._extract_code_patterns(problem.solution_code),
            
            # Metadata
            'difficulty': problem.difficulty,
            'tags': set(problem.tags),
            'release_date': problem.release_date
        }
        
        return features
    
    def _extract_algorithm_keywords(self, description: str) -> Set[str]:
        """Extract algorithmic concepts from problem description"""
        algorithm_patterns = {
            'binary_search': r'\b(binary search|search|sorted|log)\b',
            'dynamic_programming': r'\b(dp|dynamic|optimal|subproblem|memoization)\b',
            'graph': r'\b(graph|node|edge|tree|connected|path|cycle)\b',
            'greedy': r'\b(greedy|optimal|maximize|minimize|best)\b',
            'two_pointer': r'\b(two pointer|left|right|pointer)\b',
            'sliding_window': r'\b(window|substring|subarray|contiguous)\b',
            'backtracking': r'\b(backtrack|permutation|combination|generate)\b'
        }
        
        keywords = set()
        text = description.lower()
        
        for keyword, pattern in algorithm_patterns.items():
            if re.search(pattern, text):
                keywords.add(keyword)
        
        return keywords
    
    def _extract_constraint_signature(self, description: str) -> str:
        """Extract constraint patterns that might indicate similar problems"""
        # Extract numerical constraints
        constraint_pattern = r'(\d+)\s*<=?\s*(\w+)\s*<=?\s*(\d+)'
        constraints = re.findall(constraint_pattern, description)
        
        # Normalize constraint ranges
        normalized = []
        for min_val, var, max_val in constraints:
            try:
                min_int, max_int = int(min_val), int(max_val)
                range_type = 'small' if max_int < 100 else 'medium' if max_int < 10000 else 'large'
                normalized.append(f"{var}:{range_type}")
            except ValueError:
                continue
                
        return '|'.join(sorted(normalized))
    
    def _extract_io_pattern(self, test_cases: List[Dict]) -> str:
        """Extract input/output pattern signature"""
        if not test_cases:
            return "no_tests"
            
        patterns = []
        for case in test_cases[:3]:  # Analyze first 3 test cases
            input_data = case.get('input', {})
            output_data = case.get('output')
            
            # Analyze input structure
            input_types = []
            for key, value in input_data.items():
                if isinstance(value, list):
                    input_types.append(f"list[{len(value)}]")
                elif isinstance(value, int):
                    input_types.append("int")
                elif isinstance(value, str):
                    input_types.append("str")
            
            # Analyze output type
            if isinstance(output_data, list):
                output_type = f"list[{len(output_data)}]"
            elif isinstance(output_data, int):
                output_type = "int"
            elif isinstance(output_data, bool):
                output_type = "bool"
            else:
                output_type = "other"
            
            patterns.append(f"{'_'.join(input_types)}->{output_type}")
        
        return '|'.join(patterns)
    
    def _analyze_solution_complexity(self, code: str) -> Dict:
        """Analyze solution complexity patterns"""
        lines = [line.strip() for line in code.split('\n') if line.strip()]
        
        complexity_indicators = {
            'nested_loops': code.count('for') + code.count('while'),
            'recursion': 'return' in code and any(func in code for func in ['def ', 'self.']),
            'builtin_usage': sum(1 for builtin in ['sum(', 'max(', 'min(', 'sorted(', 'len('] if builtin in code),
            'data_structures': sum(1 for ds in ['dict(', 'set(', 'list(', 'defaultdict'] if ds in code),
            'line_count': len(lines)
        }
        
        return complexity_indicators
    
    def _extract_code_patterns(self, code: str) -> Set[str]:
        """Extract common coding patterns"""
        patterns = set()
        
        pattern_signatures = {
            'two_pointer': r'(left.*right|i.*j.*while)',
            'sliding_window': r'(window|start.*end)',
            'binary_search': r'(left.*right.*mid|while.*left.*<=.*right)',
            'dp_array': r'(dp\[|memo\[)',
            'dfs_pattern': r'def.*dfs|def.*helper',
            'early_return': r'if.*return.*(?:True|False|-1|0)'
        }
        
        for pattern_name, regex in pattern_signatures.items():
            if re.search(regex, code, re.IGNORECASE):
                patterns.add(pattern_name)
        
        return patterns
    
    def calculate_similarity_matrix(self, problems: List[Problem]) -> np.ndarray:
        """Calculate comprehensive similarity matrix between problems"""
        n = len(problems)
        similarity_matrix = np.zeros((n, n))
        
        # Extract features for all problems
        features = [self.extract_problem_features(p) for p in problems]
        
        # Calculate text similarity using TF-IDF
        descriptions = [p.description for p in problems]
        tfidf_matrix = self.vectorizer.fit_transform(descriptions)
        text_similarity = cosine_similarity(tfidf_matrix)
        
        for i in range(n):
            for j in range(i, n):
                if i == j:
                    similarity_matrix[i][j] = 1.0
                    continue
                
                # Multi-dimensional similarity calculation
                similarities = {
                    'text': text_similarity[i][j],
                    'algorithm': self._algorithm_similarity(features[i], features[j]),
                    'constraint': self._constraint_similarity(features[i], features[j]),
                    'io_pattern': self._io_similarity(features[i], features[j]),
                    'code_pattern': self._code_similarity(features[i], features[j]),
                    'metadata': self._metadata_similarity(features[i], features[j])
                }
                
                # Weighted combination
                weights = {
                    'text': 0.3,
                    'algorithm': 0.25,
                    'constraint': 0.15,
                    'io_pattern': 0.15,
                    'code_pattern': 0.1,
                    'metadata': 0.05
                }
                
                total_similarity = sum(similarities[k] * weights[k] for k in similarities)
                similarity_matrix[i][j] = similarity_matrix[j][i] = total_similarity
        
        return similarity_matrix
    
    def _algorithm_similarity(self, feat1: Dict, feat2: Dict) -> float:
        """Calculate algorithmic concept similarity"""
        keywords1 = feat1.get('algorithm_keywords', set())
        keywords2 = feat2.get('algorithm_keywords', set())
        
        if not keywords1 and not keywords2:
            return 0.5  # Both have no clear algorithmic indicators
        
        intersection = len(keywords1 & keywords2)
        union = len(keywords1 | keywords2)
        
        return intersection / union if union > 0 else 0
    
    def _constraint_similarity(self, feat1: Dict, feat2: Dict) -> float:
        """Calculate constraint pattern similarity"""
        const1 = feat1.get('constraint_signature', '')
        const2 = feat2.get('constraint_signature', '')
        
        if const1 == const2:
            return 1.0
        
        # Partial match for constraint patterns
        parts1 = set(const1.split('|')) if const1 else set()
        parts2 = set(const2.split('|')) if const2 else set()
        
        if not parts1 and not parts2:
            return 0.5
        
        intersection = len(parts1 & parts2)
        union = len(parts1 | parts2)
        
        return intersection / union if union > 0 else 0
    
    def _io_similarity(self, feat1: Dict, feat2: Dict) -> float:
        """Calculate input/output pattern similarity"""
        pattern1 = feat1.get('io_pattern', '')
        pattern2 = feat2.get('io_pattern', '')
        
        if pattern1 == pattern2:
            return 1.0
        
        # Check for similar patterns
        parts1 = set(pattern1.split('|')) if pattern1 else set()
        parts2 = set(pattern2.split('|')) if pattern2 else set()
        
        if not parts1 and not parts2:
            return 0.5
        
        intersection = len(parts1 & parts2)
        union = len(parts1 | parts2)
        
        return intersection / union if union > 0 else 0
    
    def _code_similarity(self, feat1: Dict, feat2: Dict) -> float:
        """Calculate code pattern similarity"""
        patterns1 = feat1.get('code_patterns', set())
        patterns2 = feat2.get('code_patterns', set())
        
        if not patterns1 and not patterns2:
            return 0.5
        
        intersection = len(patterns1 & patterns2)
        union = len(patterns1 | patterns2)
        
        return intersection / union if union > 0 else 0
    
    def _metadata_similarity(self, feat1: Dict, feat2: Dict) -> float:
        """Calculate metadata similarity"""
        # Difficulty similarity
        diff_sim = 1.0 if feat1.get('difficulty') == feat2.get('difficulty') else 0.0
        
        # Tag similarity
        tags1 = feat1.get('tags', set())
        tags2 = feat2.get('tags', set())
        
        if tags1 and tags2:
            tag_sim = len(tags1 & tags2) / len(tags1 | tags2)
        else:
            tag_sim = 0.5
        
        return (diff_sim + tag_sim) / 2

# Test the similarity detector
def create_mock_problems() -> List[Problem]:
    """Create mock problems for testing similarity detection"""
    problems = []
    
    # Problem 1: Original missing number in AP
    problems.append(Problem(
        id="missing_ap_1",
        title="Missing Number in Arithmetic Progression",
        description="""Given an array representing an arithmetic progression with one missing element,
        find the missing number. Constraints: 3 <= arr.length <= 1000, 0 <= arr[i] <= 10^5""",
        solution_code="""def missingNumber(arr):
            n = len(arr)
            expected = (n + 1) * (arr[0] + arr[-1]) // 2
            return expected - sum(arr)""",
        test_cases=[{"input": {"arr": [5, 7, 11, 13]}, "output": 9}],
        release_date=datetime(2019, 10, 15),
        difficulty="Easy",
        tags=["Array", "Math"],
        source_platform="LeetCode"
    ))
    
    # Problem 2: Very similar problem (potential contamination)
    problems.append(Problem(
        id="missing_ap_2",
        title="Find Missing Element in Sequence",
        description="""You are given an arithmetic sequence with one element removed.
        Return the missing element. Constraints: 3 <= sequence.length <= 1000, 0 <= sequence[i] <= 10^5""",
        solution_code="""def findMissing(sequence):
            length = len(sequence)
            total_sum = (length + 1) * (sequence[0] + sequence[-1]) // 2
            return total_sum - sum(sequence)""",
        test_cases=[{"input": {"sequence": [2, 4, 8, 10]}, "output": 6}],
        release_date=datetime(2024, 8, 20),
        difficulty="Easy",
        tags=["Array", "Math"],
        source_platform="CodeForces"
    ))
    
    # Problem 3: Different problem (binary search)
    problems.append(Problem(
        id="binary_search_1",
        title="Search in Sorted Array",
        description="""Given a sorted array and target value, return the index if found.
        Constraints: 1 <= nums.length <= 10^4, -10^4 <= nums[i] <= 10^4""",
        solution_code="""def search(nums, target):
            left, right = 0, len(nums) - 1
            while left <= right:
                mid = (left + right) // 2
                if nums[mid] == target: return mid
                elif nums[mid] < target: left = mid + 1
                else: right = mid - 1
            return -1""",
        test_cases=[{"input": {"nums": [1, 3, 5, 7, 9], "target": 5}, "output": 2}],
        release_date=datetime(2020, 3, 10),
        difficulty="Easy",
        tags=["Array", "Binary Search"],
        source_platform="LeetCode"
    ))
    
    # Problem 4: Another AP problem with different approach
    problems.append(Problem(
        id="missing_ap_3",
        title="Arithmetic Progression Gap",
        description="""Find the gap in an arithmetic progression array.
        Constraints: 3 <= arr.length <= 500, -1000 <= arr[i] <= 1000""",
        solution_code="""def findGap(arr):
            n = len(arr)
            diff = (arr[-1] - arr[0]) // n
            for i in range(n - 1):
                if arr[i + 1] - arr[i] != diff:
                    return arr[i] + diff
            return arr[-1] + diff""",
        test_cases=[{"input": {"arr": [1, 3, 7, 9]}, "output": 5}],
        release_date=datetime(2024, 12, 1),
        difficulty="Medium",
        tags=["Array", "Math"],
        source_platform="AtCoder"
    ))
    
    return problems

# Test similarity detection
detector = ProblemSimilarityDetector()
mock_problems = create_mock_problems()

print("Problem Similarity Analysis:")
print("===========================\n")

# Extract features
for i, problem in enumerate(mock_problems):
    features = detector.extract_problem_features(problem)
    print(f"Problem {i+1}: {problem.title}")
    print(f"  Algorithm Keywords: {features['algorithm_keywords']}")
    print(f"  Constraint Signature: {features['constraint_signature']}")
    print(f"  IO Pattern: {features['io_pattern']}")
    print(f"  Code Patterns: {features['code_patterns']}")
    print()

# Calculate similarity matrix
similarity_matrix = detector.calculate_similarity_matrix(mock_problems)

print("Similarity Matrix:")
print("==================")
problem_names = [p.title[:20] + "..." for p in mock_problems]
df = pd.DataFrame(similarity_matrix, index=problem_names, columns=problem_names)
print(df.round(3))

## 3. Statistical Contamination Detection

### Advanced Statistical Methods for Detecting Memorization

In [None]:
class StatisticalContaminationDetector:
    """Advanced statistical methods for contamination detection"""
    
    def __init__(self):
        self.confidence_level = 0.95
        self.effect_size_threshold = 0.5  # Cohen's d
        
    def analyze_performance_distribution(self, results: pd.DataFrame) -> Dict:
        """Analyze performance distribution for contamination signals"""
        analysis = {}
        
        for model in results['model'].unique():
            model_data = results[results['model'] == model]
            
            # Basic statistics
            scores = model_data['score']
            analysis[model] = {
                'mean': scores.mean(),
                'std': scores.std(),
                'skewness': stats.skew(scores),
                'kurtosis': stats.kurtosis(scores),
                'outlier_count': self._count_outliers(scores),
                'distribution_test': self._test_normality(scores)
            }
            
            # Contamination indicators
            analysis[model]['contamination_indicators'] = {
                'high_outliers': (scores > scores.mean() + 2*scores.std()).sum(),
                'perfect_scores': (scores == 100).sum(),
                'score_clustering': self._detect_score_clustering(scores),
                'bimodal_distribution': self._test_bimodality(scores)
            }
        
        return analysis
    
    def _count_outliers(self, scores: pd.Series) -> int:
        """Count outliers using IQR method"""
        Q1 = scores.quantile(0.25)
        Q3 = scores.quantile(0.75)
        IQR = Q3 - Q1
        
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        
        return ((scores < lower_bound) | (scores > upper_bound)).sum()
    
    def _test_normality(self, scores: pd.Series) -> Dict:
        """Test if score distribution is normal"""
        if len(scores) < 8:
            return {'test': 'insufficient_data', 'p_value': None, 'is_normal': None}
        
        # Shapiro-Wilk test for normality
        statistic, p_value = stats.shapiro(scores)
        
        return {
            'test': 'shapiro_wilk',
            'statistic': statistic,
            'p_value': p_value,
            'is_normal': p_value > 0.05
        }
    
    def _detect_score_clustering(self, scores: pd.Series) -> Dict:
        """Detect unusual clustering in scores"""
        if len(scores) < 5:
            return {'clusters_detected': False, 'num_clusters': 0}
        
        # Use DBSCAN to detect clusters
        scores_array = scores.values.reshape(-1, 1)
        
        # Try different epsilon values
        best_clustering = None
        best_score = -1
        
        for eps in [1, 2, 5, 10]:
            clustering = DBSCAN(eps=eps, min_samples=2).fit(scores_array)
            n_clusters = len(set(clustering.labels_)) - (1 if -1 in clustering.labels_ else 0)
            
            if n_clusters > 1 and n_clusters < len(scores) / 2:
                # Calculate silhouette score if possible
                if n_clusters > 1 and len(set(clustering.labels_)) > 1:
                    from sklearn.metrics import silhouette_score
                    try:
                        score = silhouette_score(scores_array, clustering.labels_)
                        if score > best_score:
                            best_score = score
                            best_clustering = clustering
                    except:
                        pass
        
        if best_clustering is not None:
            return {
                'clusters_detected': True,
                'num_clusters': len(set(best_clustering.labels_)) - (1 if -1 in best_clustering.labels_ else 0),
                'silhouette_score': best_score
            }
        
        return {'clusters_detected': False, 'num_clusters': 0}
    
    def _test_bimodality(self, scores: pd.Series) -> Dict:
        """Test for bimodal distribution (sign of contamination)"""
        if len(scores) < 10:
            return {'bimodal': False, 'confidence': 0}
        
        # Calculate Hartigan's dip test statistic (simplified)
        hist, bin_edges = np.histogram(scores, bins=min(10, len(scores)//2))
        
        # Look for two distinct peaks
        peaks = []
        for i in range(1, len(hist)-1):
            if hist[i] > hist[i-1] and hist[i] > hist[i+1]:
                peaks.append(i)
        
        # Check if peaks are separated by a valley
        bimodal = False
        if len(peaks) >= 2:
            # Find minimum between peaks
            peak1, peak2 = peaks[0], peaks[-1]
            valley_min = min(hist[peak1+1:peak2])
            peak_min = min(hist[peak1], hist[peak2])
            
            # Bimodal if valley is significantly lower than peaks
            if valley_min < peak_min * 0.5:
                bimodal = True
        
        return {
            'bimodal': bimodal,
            'num_peaks': len(peaks),
            'confidence': 0.8 if bimodal else 0.2
        }
    
    def temporal_performance_analysis(self, results: pd.DataFrame) -> Dict:
        """Analyze performance changes over time to detect contamination"""
        analysis = {}
        
        for model in results['model'].unique():
            model_data = results[results['model'] == model].copy()
            model_data = model_data.sort_values('release_date')
            
            # Convert dates to numeric for correlation
            dates_numeric = pd.to_numeric(model_data['release_date'])
            scores = model_data['score']
            
            # Calculate temporal correlation
            correlation, p_value = stats.pearsonr(dates_numeric, scores)
            
            # Detect change points
            change_points = self._detect_change_points(scores)
            
            # Calculate trend
            if len(scores) >= 3:
                slope, intercept, r_value, p_value_trend, std_err = stats.linregress(
                    range(len(scores)), scores
                )
            else:
                slope = p_value_trend = 0
                r_value = 0
            
            analysis[model] = {
                'temporal_correlation': correlation,
                'correlation_p_value': p_value,
                'trend_slope': slope,
                'trend_r_squared': r_value**2,
                'trend_p_value': p_value_trend,
                'change_points': change_points,
                'contamination_risk': self._calculate_contamination_risk(
                    correlation, slope, change_points
                )
            }
        
        return analysis
    
    def _detect_change_points(self, scores: pd.Series) -> List[int]:
        """Detect significant change points in performance"""
        if len(scores) < 6:
            return []
        
        change_points = []
        window_size = max(3, len(scores) // 4)
        
        for i in range(window_size, len(scores) - window_size):
            before = scores.iloc[:i]
            after = scores.iloc[i:]
            
            # T-test for significant difference
            if len(before) >= 2 and len(after) >= 2:
                t_stat, p_value = stats.ttest_ind(before, after)
                
                # Significant change if p < 0.05 and effect size > threshold
                if p_value < 0.05:
                    effect_size = abs(before.mean() - after.mean()) / np.sqrt(
                        ((len(before)-1)*before.var() + (len(after)-1)*after.var()) / 
                        (len(before) + len(after) - 2)
                    )
                    
                    if effect_size > self.effect_size_threshold:
                        change_points.append(i)
        
        return change_points
    
    def _calculate_contamination_risk(self, correlation: float, 
                                    slope: float, 
                                    change_points: List[int]) -> Dict:
        """Calculate overall contamination risk score"""
        risk_factors = {
            'negative_temporal_correlation': max(0, -correlation) * 0.4,
            'negative_trend': max(0, -slope) * 0.3,
            'change_points': min(1.0, len(change_points) / 3) * 0.3
        }
        
        total_risk = sum(risk_factors.values())
        
        # Risk classification
        if total_risk > 0.7:
            risk_level = "HIGH"
        elif total_risk > 0.4:
            risk_level = "MEDIUM"
        elif total_risk > 0.2:
            risk_level = "LOW"
        else:
            risk_level = "MINIMAL"
        
        return {
            'total_score': total_risk,
            'risk_level': risk_level,
            'factors': risk_factors
        }

# Create mock performance data for testing
def create_mock_performance_data() -> pd.DataFrame:
    """Create mock performance data with contamination signals"""
    np.random.seed(42)
    data = []
    
    # Clean model (consistent performance)
    clean_dates = pd.date_range('2024-07-01', '2024-12-31', freq='W')
    clean_scores = np.random.normal(65, 8, len(clean_dates))  # Consistent performance
    
    for date, score in zip(clean_dates, clean_scores):
        data.append({
            'model': 'CleanModel',
            'release_date': date,
            'score': max(0, min(100, score))
        })
    
    # Contaminated model (declining performance over time)
    cont_dates = pd.date_range('2024-07-01', '2024-12-31', freq='W')
    # High performance initially, declining over time
    cont_base = 85 - np.linspace(0, 25, len(cont_dates))  # Declining trend
    cont_scores = cont_base + np.random.normal(0, 5, len(cont_dates))
    
    for date, score in zip(cont_dates, cont_scores):
        data.append({
            'model': 'ContaminatedModel',
            'release_date': date,
            'score': max(0, min(100, score))
        })
    
    # Suspicious model (bimodal distribution)
    susp_dates = pd.date_range('2024-07-01', '2024-12-31', freq='W')
    # Mix of high and low scores (memorized vs. new problems)
    susp_scores = []
    for i in range(len(susp_dates)):
        if np.random.random() < 0.6:  # 60% high scores (memorized)
            score = np.random.normal(85, 5)
        else:  # 40% low scores (new problems)
            score = np.random.normal(35, 8)
        susp_scores.append(max(0, min(100, score)))
    
    for date, score in zip(susp_dates, susp_scores):
        data.append({
            'model': 'SuspiciousModel',
            'release_date': date,
            'score': score
        })
    
    return pd.DataFrame(data)

# Test statistical contamination detection
stat_detector = StatisticalContaminationDetector()
mock_performance = create_mock_performance_data()

print("Statistical Contamination Analysis:")
print("==================================\n")

# Analyze performance distributions
distribution_analysis = stat_detector.analyze_performance_distribution(mock_performance)

for model, analysis in distribution_analysis.items():
    print(f"Model: {model}")
    print(f"  Mean Score: {analysis['mean']:.2f} ± {analysis['std']:.2f}")
    print(f"  Skewness: {analysis['skewness']:.3f}")
    print(f"  Outliers: {analysis['outlier_count']}")
    print(f"  Perfect Scores: {analysis['contamination_indicators']['perfect_scores']}")
    print(f"  Bimodal: {analysis['contamination_indicators']['bimodal_distribution']['bimodal']}")
    print()

# Temporal analysis
temporal_analysis = stat_detector.temporal_performance_analysis(mock_performance)

print("Temporal Contamination Analysis:")
print("===============================\n")

for model, analysis in temporal_analysis.items():
    print(f"Model: {model}")
    print(f"  Temporal Correlation: {analysis['temporal_correlation']:.3f}")
    print(f"  Trend Slope: {analysis['trend_slope']:.3f}")
    print(f"  Change Points: {len(analysis['change_points'])}")
    print(f"  Contamination Risk: {analysis['contamination_risk']['risk_level']} ({analysis['contamination_risk']['total_score']:.3f})")
    print()

## 4. Dataset Versioning and Provenance Tracking

### Building Robust Dataset Management Systems

In [None]:
import uuid
from typing import Optional, Union
from enum import Enum
import sqlite3
import pickle
import gzip

class ChangeType(Enum):
    """Types of changes in dataset"""
    ADDITION = "addition"
    MODIFICATION = "modification"
    REMOVAL = "removal"
    SPLIT_UPDATE = "split_update"
    METADATA_UPDATE = "metadata_update"

@dataclass
class DatasetChange:
    """Record of a change to the dataset"""
    change_id: str
    timestamp: datetime
    change_type: ChangeType
    affected_items: List[str]  # Problem IDs
    description: str
    author: str
    checksum_before: Optional[str]
    checksum_after: str
    metadata: Dict

@dataclass 
class DatasetVersion:
    """Complete dataset version"""
    version_id: str
    version_number: str
    creation_date: datetime
    description: str
    total_problems: int
    train_count: int
    test_count: int
    checksum: str
    parent_version: Optional[str]
    changes_since_parent: List[str]  # Change IDs
    contamination_score: float
    quality_metrics: Dict

class DatasetVersionControl:
    """Complete version control system for temporal datasets"""
    
    def __init__(self, db_path: str = ":memory:"):
        self.db_path = db_path
        self.conn = sqlite3.connect(db_path)
        self.similarity_detector = ProblemSimilarityDetector()
        self.contamination_detector = StatisticalContaminationDetector()
        self._init_database()
        
    def _init_database(self):
        """Initialize database schema"""
        self.conn.executescript("""
        CREATE TABLE IF NOT EXISTS problems (
            id TEXT PRIMARY KEY,
            title TEXT,
            description TEXT,
            solution_code TEXT,
            test_cases BLOB,
            release_date TEXT,
            difficulty TEXT,
            tags TEXT,
            source_platform TEXT,
            checksum TEXT,
            created_at TEXT,
            updated_at TEXT
        );
        
        CREATE TABLE IF NOT EXISTS dataset_versions (
            version_id TEXT PRIMARY KEY,
            version_number TEXT,
            creation_date TEXT,
            description TEXT,
            total_problems INTEGER,
            train_count INTEGER,
            test_count INTEGER,
            checksum TEXT,
            parent_version TEXT,
            contamination_score REAL,
            quality_metrics BLOB
        );
        
        CREATE TABLE IF NOT EXISTS dataset_changes (
            change_id TEXT PRIMARY KEY,
            timestamp TEXT,
            change_type TEXT,
            affected_items BLOB,
            description TEXT,
            author TEXT,
            checksum_before TEXT,
            checksum_after TEXT,
            metadata BLOB
        );
        
        CREATE TABLE IF NOT EXISTS version_changes (
            version_id TEXT,
            change_id TEXT,
            PRIMARY KEY (version_id, change_id)
        );
        
        CREATE TABLE IF NOT EXISTS similarity_cache (
            problem1_id TEXT,
            problem2_id TEXT,
            similarity_score REAL,
            computed_at TEXT,
            PRIMARY KEY (problem1_id, problem2_id)
        );
        """)
        self.conn.commit()
    
    def add_problem(self, problem: Problem, author: str = "system") -> str:
        """Add a new problem and track the change"""
        # Calculate checksum
        problem_data = f"{problem.title}{problem.description}{problem.solution_code}"
        checksum = hashlib.sha256(problem_data.encode()).hexdigest()
        
        # Check for duplicates
        existing = self._find_similar_problems(problem)
        if existing:
            print(f"Warning: Similar problems found: {[p['id'] for p in existing]}")
        
        # Insert problem
        now = datetime.now().isoformat()
        test_cases_blob = gzip.compress(pickle.dumps(problem.test_cases))
        
        self.conn.execute("""
        INSERT INTO problems 
        (id, title, description, solution_code, test_cases, release_date, 
         difficulty, tags, source_platform, checksum, created_at, updated_at)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
        """, (
            problem.id, problem.title, problem.description, problem.solution_code,
            test_cases_blob, problem.release_date.isoformat(),
            problem.difficulty, json.dumps(problem.tags), problem.source_platform,
            checksum, now, now
        ))
        
        # Record change
        change = DatasetChange(
            change_id=str(uuid.uuid4()),
            timestamp=datetime.now(),
            change_type=ChangeType.ADDITION,
            affected_items=[problem.id],
            description=f"Added problem: {problem.title}",
            author=author,
            checksum_before=None,
            checksum_after=self._calculate_dataset_checksum(),
            metadata={'similarity_warnings': len(existing)}
        )
        
        self._record_change(change)
        self.conn.commit()
        
        return change.change_id
    
    def _find_similar_problems(self, problem: Problem, threshold: float = 0.8) -> List[Dict]:
        """Find similar problems in database"""
        # Get all existing problems
        cursor = self.conn.execute(
            "SELECT id, title, description, solution_code, release_date, difficulty, tags FROM problems"
        )
        
        similar_problems = []
        for row in cursor.fetchall():
            existing = Problem(
                id=row[0], title=row[1], description=row[2], solution_code=row[3],
                test_cases=[], release_date=datetime.fromisoformat(row[4]),
                difficulty=row[5], tags=json.loads(row[6]),
                source_platform="unknown"
            )
            
            # Calculate similarity
            similarity = self._calculate_problem_similarity(problem, existing)
            
            if similarity > threshold:
                similar_problems.append({
                    'id': existing.id,
                    'title': existing.title,
                    'similarity': similarity
                })
        
        return similar_problems
    
    def _calculate_problem_similarity(self, problem1: Problem, problem2: Problem) -> float:
        """Calculate similarity between two problems"""
        # Use cached similarity if available
        cache_key = tuple(sorted([problem1.id, problem2.id]))
        
        cursor = self.conn.execute(
            "SELECT similarity_score FROM similarity_cache WHERE problem1_id=? AND problem2_id=?",
            cache_key
        )
        cached = cursor.fetchone()
        
        if cached:
            return cached[0]
        
        # Calculate similarity
        similarity_matrix = self.similarity_detector.calculate_similarity_matrix([problem1, problem2])
        similarity = similarity_matrix[0, 1]
        
        # Cache result
        self.conn.execute(
            "INSERT OR REPLACE INTO similarity_cache VALUES (?, ?, ?, ?)",
            (cache_key[0], cache_key[1], similarity, datetime.now().isoformat())
        )
        
        return similarity
    
    def create_version(self, version_number: str, description: str, 
                      cutoff_date: Optional[datetime] = None,
                      author: str = "system") -> str:
        """Create a new dataset version with temporal split"""
        version_id = str(uuid.uuid4())
        
        # Get current dataset state
        cursor = self.conn.execute("SELECT COUNT(*) FROM problems")
        total_problems = cursor.fetchone()[0]
        
        # Calculate train/test split
        if cutoff_date:
            cursor = self.conn.execute(
                "SELECT COUNT(*) FROM problems WHERE release_date < ?",
                (cutoff_date.isoformat(),)
            )
            train_count = cursor.fetchone()[0]
            test_count = total_problems - train_count
        else:
            # Default 80/20 split
            train_count = int(total_problems * 0.8)
            test_count = total_problems - train_count
        
        # Calculate quality metrics
        quality_metrics = self._calculate_version_quality_metrics(cutoff_date)
        
        # Calculate contamination score
        contamination_score = self._calculate_version_contamination_score()
        
        # Get parent version
        cursor = self.conn.execute(
            "SELECT version_id FROM dataset_versions ORDER BY creation_date DESC LIMIT 1"
        )
        parent = cursor.fetchone()
        parent_version = parent[0] if parent else None
        
        # Calculate dataset checksum
        checksum = self._calculate_dataset_checksum()
        
        # Create version
        version = DatasetVersion(
            version_id=version_id,
            version_number=version_number,
            creation_date=datetime.now(),
            description=description,
            total_problems=total_problems,
            train_count=train_count,
            test_count=test_count,
            checksum=checksum,
            parent_version=parent_version,
            changes_since_parent=[],
            contamination_score=contamination_score,
            quality_metrics=quality_metrics
        )
        
        # Insert version
        quality_blob = gzip.compress(pickle.dumps(quality_metrics))
        
        self.conn.execute("""
        INSERT INTO dataset_versions
        (version_id, version_number, creation_date, description, total_problems,
         train_count, test_count, checksum, parent_version, contamination_score, quality_metrics)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
        """, (
            version.version_id, version.version_number, version.creation_date.isoformat(),
            version.description, version.total_problems, version.train_count,
            version.test_count, version.checksum, version.parent_version,
            version.contamination_score, quality_blob
        ))
        
        self.conn.commit()
        return version_id
    
    def _calculate_dataset_checksum(self) -> str:
        """Calculate checksum of entire dataset"""
        cursor = self.conn.execute(
            "SELECT checksum FROM problems ORDER BY id"
        )
        checksums = [row[0] for row in cursor.fetchall()]
        combined = ''.join(checksums)
        return hashlib.sha256(combined.encode()).hexdigest()
    
    def _calculate_version_quality_metrics(self, cutoff_date: Optional[datetime]) -> Dict:
        """Calculate quality metrics for version"""
        metrics = {
            'duplicate_pairs': 0,
            'high_similarity_pairs': 0,
            'temporal_coverage_days': 0,
            'difficulty_distribution': {},
            'tag_diversity': 0
        }
        
        # Get all problems
        cursor = self.conn.execute(
            "SELECT id, release_date, difficulty, tags FROM problems"
        )
        problems = cursor.fetchall()
        
        if not problems:
            return metrics
        
        # Calculate temporal coverage
        dates = [datetime.fromisoformat(p[1]) for p in problems]
        metrics['temporal_coverage_days'] = (max(dates) - min(dates)).days
        
        # Difficulty distribution
        difficulties = [p[2] for p in problems]
        metrics['difficulty_distribution'] = dict(pd.Series(difficulties).value_counts())
        
        # Tag diversity
        all_tags = set()
        for p in problems:
            tags = json.loads(p[3])
            all_tags.update(tags)
        metrics['tag_diversity'] = len(all_tags)
        
        return metrics
    
    def _calculate_version_contamination_score(self) -> float:
        """Calculate contamination risk score for version"""
        # Get similarity matrix for all problems
        cursor = self.conn.execute(
            "SELECT AVG(similarity_score) FROM similarity_cache WHERE similarity_score > 0.8"
        )
        high_sim_avg = cursor.fetchone()[0] or 0
        
        cursor = self.conn.execute(
            "SELECT COUNT(*) FROM similarity_cache WHERE similarity_score > 0.9"
        )
        very_high_sim_count = cursor.fetchone()[0] or 0
        
        # Simple contamination score (0-1)
        contamination_score = min(1.0, (high_sim_avg * 0.5) + (very_high_sim_count * 0.1))
        
        return contamination_score
    
    def _record_change(self, change: DatasetChange):
        """Record a change in the database"""
        affected_blob = gzip.compress(pickle.dumps(change.affected_items))
        metadata_blob = gzip.compress(pickle.dumps(change.metadata))
        
        self.conn.execute("""
        INSERT INTO dataset_changes
        (change_id, timestamp, change_type, affected_items, description,
         author, checksum_before, checksum_after, metadata)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
        """, (
            change.change_id, change.timestamp.isoformat(), change.change_type.value,
            affected_blob, change.description, change.author,
            change.checksum_before, change.checksum_after, metadata_blob
        ))
    
    def get_version_history(self) -> pd.DataFrame:
        """Get version history with quality metrics"""
        cursor = self.conn.execute("""
        SELECT version_number, creation_date, description, total_problems,
               train_count, test_count, contamination_score
        FROM dataset_versions
        ORDER BY creation_date
        """)
        
        columns = ['Version', 'Date', 'Description', 'Total', 'Train', 'Test', 'Contamination']
        return pd.DataFrame(cursor.fetchall(), columns=columns)
    
    def audit_contamination_risk(self, version_id: Optional[str] = None) -> Dict:
        """Comprehensive contamination risk audit"""
        # Get high similarity pairs
        cursor = self.conn.execute("""
        SELECT problem1_id, problem2_id, similarity_score
        FROM similarity_cache
        WHERE similarity_score > 0.8
        ORDER BY similarity_score DESC
        """)
        
        high_sim_pairs = cursor.fetchall()
        
        # Get temporal distribution
        cursor = self.conn.execute(
            "SELECT release_date, COUNT(*) FROM problems GROUP BY release_date ORDER BY release_date"
        )
        temporal_dist = cursor.fetchall()
        
        audit_result = {
            'high_similarity_pairs': len(high_sim_pairs),
            'max_similarity': max([pair[2] for pair in high_sim_pairs]) if high_sim_pairs else 0,
            'temporal_gaps': self._detect_temporal_gaps(temporal_dist),
            'risk_recommendations': self._generate_risk_recommendations(high_sim_pairs)
        }
        
        return audit_result
    
    def _detect_temporal_gaps(self, temporal_dist: List[Tuple]) -> List[Dict]:
        """Detect suspicious gaps in temporal distribution"""
        if len(temporal_dist) < 2:
            return []
        
        gaps = []
        for i in range(len(temporal_dist) - 1):
            date1 = datetime.fromisoformat(temporal_dist[i][0])
            date2 = datetime.fromisoformat(temporal_dist[i+1][0])
            gap_days = (date2 - date1).days
            
            if gap_days > 30:  # Gap larger than 30 days
                gaps.append({
                    'start_date': date1.isoformat(),
                    'end_date': date2.isoformat(),
                    'gap_days': gap_days
                })
        
        return gaps
    
    def _generate_risk_recommendations(self, high_sim_pairs: List[Tuple]) -> List[str]:
        """Generate recommendations based on contamination risk"""
        recommendations = []
        
        if len(high_sim_pairs) > 10:
            recommendations.append("HIGH RISK: Many similar problems detected. Consider manual review.")
        
        very_high_sim = [p for p in high_sim_pairs if p[2] > 0.95]
        if very_high_sim:
            recommendations.append(f"CRITICAL: {len(very_high_sim)} near-duplicate problems found.")
        
        if len(high_sim_pairs) > 0:
            recommendations.append("Review similar problems for potential contamination.")
            recommendations.append("Consider increasing temporal split buffer.")
        
        return recommendations

# Test the version control system
print("Dataset Version Control System Demo:")
print("===================================\n")

# Initialize version control
vcs = DatasetVersionControl()

# Add some problems
problems = create_mock_problems()
for problem in problems:
    change_id = vcs.add_problem(problem, author="demo_user")
    print(f"Added problem {problem.id} (change: {change_id[:8]}...)")

# Create a version
version_id = vcs.create_version(
    version_number="v1.0.0",
    description="Initial dataset version",
    cutoff_date=datetime(2024, 7, 1)
)
print(f"\nCreated version v1.0.0 (ID: {version_id[:8]}...)")

# Get version history
history = vcs.get_version_history()
print("\nVersion History:")
print(history.to_string(index=False))

# Audit contamination risk
audit = vcs.audit_contamination_risk()
print("\nContamination Risk Audit:")
print(f"High similarity pairs: {audit['high_similarity_pairs']}")
print(f"Max similarity: {audit['max_similarity']:.3f}")
print("Recommendations:")
for rec in audit['risk_recommendations']:
    print(f"  - {rec}")

## 5. Advanced Visualization and Monitoring

### Real-time Contamination Monitoring Dashboard

In [None]:
class ContaminationMonitoringDashboard:
    """Advanced dashboard for monitoring contamination in real-time"""
    
    def __init__(self, vcs: DatasetVersionControl):
        self.vcs = vcs
        
    def create_comprehensive_dashboard(self, mock_performance: pd.DataFrame):
        """Create comprehensive contamination monitoring dashboard"""
        fig = plt.figure(figsize=(20, 16))
        
        # Create grid layout
        gs = fig.add_gridspec(4, 4, hspace=0.3, wspace=0.3)
        
        # 1. Similarity Matrix Heatmap (top-left)
        ax1 = fig.add_subplot(gs[0, :2])
        self._plot_similarity_heatmap(ax1)
        
        # 2. Temporal Performance Trends (top-right)
        ax2 = fig.add_subplot(gs[0, 2:])
        self._plot_temporal_trends(ax2, mock_performance)
        
        # 3. Score Distribution Analysis (middle-left)
        ax3 = fig.add_subplot(gs[1, :2])
        self._plot_score_distributions(ax3, mock_performance)
        
        # 4. Contamination Risk Gauge (middle-right)
        ax4 = fig.add_subplot(gs[1, 2:])
        self._plot_contamination_gauge(ax4)
        
        # 5. Problem Release Timeline (bottom-left)
        ax5 = fig.add_subplot(gs[2, :2])
        self._plot_release_timeline(ax5)
        
        # 6. Quality Metrics Over Time (bottom-right)
        ax6 = fig.add_subplot(gs[2, 2:])
        self._plot_quality_metrics(ax6)
        
        # 7. Alert Summary (bottom)
        ax7 = fig.add_subplot(gs[3, :])
        self._plot_alert_summary(ax7)
        
        plt.suptitle('Contamination Monitoring Dashboard', fontsize=20, y=0.98)
        plt.tight_layout()
        plt.show()
    
    def _plot_similarity_heatmap(self, ax):
        """Plot similarity heatmap of problems"""
        # Get similarity data from cache
        cursor = self.vcs.conn.execute(
            "SELECT problem1_id, problem2_id, similarity_score FROM similarity_cache"
        )
        similarity_data = cursor.fetchall()
        
        if not similarity_data:
            ax.text(0.5, 0.5, 'No similarity data available', 
                   ha='center', va='center', transform=ax.transAxes)
            ax.set_title('Problem Similarity Matrix')
            return
        
        # Create similarity matrix
        problem_ids = list(set([d[0] for d in similarity_data] + [d[1] for d in similarity_data]))
        n = len(problem_ids)
        similarity_matrix = np.eye(n)
        
        id_to_idx = {pid: i for i, pid in enumerate(problem_ids)}
        
        for p1, p2, sim in similarity_data:
            i, j = id_to_idx[p1], id_to_idx[p2]
            similarity_matrix[i, j] = similarity_matrix[j, i] = sim
        
        # Plot heatmap
        im = ax.imshow(similarity_matrix, cmap='Reds', vmin=0, vmax=1)
        ax.set_title('Problem Similarity Matrix')
        ax.set_xlabel('Problems')
        ax.set_ylabel('Problems')
        
        # Add colorbar
        plt.colorbar(im, ax=ax, shrink=0.6)
        
        # Highlight high similarity
        high_sim_coords = np.where(similarity_matrix > 0.8)
        for i, j in zip(high_sim_coords[0], high_sim_coords[1]):
            if i != j:  # Don't highlight diagonal
                ax.add_patch(plt.Rectangle((j-0.5, i-0.5), 1, 1, 
                                         fill=False, edgecolor='yellow', linewidth=2))
    
    def _plot_temporal_trends(self, ax, performance_data):
        """Plot temporal performance trends"""
        for model in performance_data['model'].unique():
            model_data = performance_data[performance_data['model'] == model]
            model_data = model_data.sort_values('release_date')
            
            ax.plot(model_data['release_date'], model_data['score'], 
                   marker='o', label=model, alpha=0.7)
        
        ax.set_title('Performance Trends Over Time')
        ax.set_xlabel('Release Date')
        ax.set_ylabel('Performance Score')
        ax.legend()
        ax.grid(True, alpha=0.3)
        
        # Add contamination warning zone
        ax.axhline(y=performance_data['score'].mean() - 2*performance_data['score'].std(), 
                  color='red', linestyle='--', alpha=0.5, label='Warning Threshold')
    
    def _plot_score_distributions(self, ax, performance_data):
        """Plot score distributions for contamination detection"""
        models = performance_data['model'].unique()
        
        for i, model in enumerate(models):
            scores = performance_data[performance_data['model'] == model]['score']
            ax.hist(scores, bins=15, alpha=0.6, label=model, 
                   density=True, histtype='stepfilled')
        
        ax.set_title('Score Distribution Analysis')
        ax.set_xlabel('Performance Score')
        ax.set_ylabel('Density')
        ax.legend()
        ax.grid(True, alpha=0.3)
        
        # Add bimodality indicators
        for model in models:
            scores = performance_data[performance_data['model'] == model]['score']
            if len(scores) > 10:
                # Simple bimodality test
                hist, bins = np.histogram(scores, bins=10)
                peaks = []
                for i in range(1, len(hist)-1):
                    if hist[i] > hist[i-1] and hist[i] > hist[i+1]:
                        peaks.append(bins[i])
                
                if len(peaks) >= 2:
                    ax.axvline(x=scores.mean(), color='red', linestyle=':', 
                             alpha=0.8, label=f'{model} Bimodal Warning')
    
    def _plot_contamination_gauge(self, ax):
        """Plot contamination risk gauge"""
        # Get latest contamination score
        cursor = self.vcs.conn.execute(
            "SELECT contamination_score FROM dataset_versions ORDER BY creation_date DESC LIMIT 1"
        )
        result = cursor.fetchone()
        contamination_score = result[0] if result else 0
        
        # Create gauge
        theta = np.linspace(0, np.pi, 100)
        
        # Background arc
        ax.plot(theta, np.ones_like(theta), 'k-', linewidth=10, alpha=0.3)
        
        # Risk zones
        low_zone = theta[theta <= np.pi * 0.33]
        medium_zone = theta[(theta > np.pi * 0.33) & (theta <= np.pi * 0.66)]
        high_zone = theta[theta > np.pi * 0.66]
        
        ax.plot(low_zone, np.ones_like(low_zone), 'g-', linewidth=10, alpha=0.7, label='Low Risk')
        ax.plot(medium_zone, np.ones_like(medium_zone), 'y-', linewidth=10, alpha=0.7, label='Medium Risk')
        ax.plot(high_zone, np.ones_like(high_zone), 'r-', linewidth=10, alpha=0.7, label='High Risk')
        
        # Needle
        needle_angle = contamination_score * np.pi
        ax.arrow(0, 0, np.cos(needle_angle) * 0.8, np.sin(needle_angle) * 0.8,
                head_width=0.1, head_length=0.1, fc='black', ec='black')
        
        ax.set_xlim(-1.2, 1.2)
        ax.set_ylim(-0.2, 1.2)
        ax.set_aspect('equal')
        ax.axis('off')
        ax.set_title(f'Contamination Risk: {contamination_score:.2f}')
        
        # Add score text
        ax.text(0, -0.1, f'{contamination_score:.3f}', ha='center', va='center', 
               fontsize=16, fontweight='bold')
    
    def _plot_release_timeline(self, ax):
        """Plot problem release timeline"""
        cursor = self.vcs.conn.execute(
            "SELECT release_date, difficulty FROM problems ORDER BY release_date"
        )
        problems = cursor.fetchall()
        
        if not problems:
            ax.text(0.5, 0.5, 'No problems in database', 
                   ha='center', va='center', transform=ax.transAxes)
            ax.set_title('Problem Release Timeline')
            return
        
        dates = [datetime.fromisoformat(p[0]) for p in problems]
        difficulties = [p[1] for p in problems]
        
        # Color mapping
        color_map = {'Easy': 'green', 'Medium': 'orange', 'Hard': 'red'}
        colors = [color_map.get(d, 'blue') for d in difficulties]
        
        # Scatter plot
        ax.scatter(dates, range(len(dates)), c=colors, alpha=0.7, s=50)
        
        ax.set_title('Problem Release Timeline')
        ax.set_xlabel('Release Date')
        ax.set_ylabel('Problem Index')
        
        # Add legend
        for difficulty, color in color_map.items():
            ax.scatter([], [], c=color, label=difficulty, s=50)
        ax.legend()
        
        # Rotate x-axis labels
        plt.setp(ax.xaxis.get_majorticklabels(), rotation=45)
    
    def _plot_quality_metrics(self, ax):
        """Plot quality metrics over time"""
        cursor = self.vcs.conn.execute(
            "SELECT creation_date, total_problems, contamination_score FROM dataset_versions ORDER BY creation_date"
        )
        versions = cursor.fetchall()
        
        if not versions:
            ax.text(0.5, 0.5, 'No versions available', 
                   ha='center', va='center', transform=ax.transAxes)
            ax.set_title('Quality Metrics Over Time')
            return
        
        dates = [datetime.fromisoformat(v[0]) for v in versions]
        total_problems = [v[1] for v in versions]
        contamination_scores = [v[2] for v in versions]
        
        # Dual y-axis plot
        ax2 = ax.twinx()
        
        line1 = ax.plot(dates, total_problems, 'b-o', label='Total Problems')
        line2 = ax2.plot(dates, contamination_scores, 'r-s', label='Contamination Score')
        
        ax.set_xlabel('Version Date')
        ax.set_ylabel('Total Problems', color='b')
        ax2.set_ylabel('Contamination Score', color='r')
        ax.set_title('Dataset Quality Metrics')
        
        # Combine legends
        lines = line1 + line2
        labels = [l.get_label() for l in lines]
        ax.legend(lines, labels, loc='upper left')
    
    def _plot_alert_summary(self, ax):
        """Plot alert summary"""
        # Get contamination audit
        audit = self.vcs.audit_contamination_risk()
        
        # Create alert summary
        alerts = {
            'High Similarity Pairs': audit['high_similarity_pairs'],
            'Max Similarity': f"{audit['max_similarity']:.3f}",
            'Temporal Gaps': len(audit['temporal_gaps']),
            'Risk Level': 'HIGH' if audit['max_similarity'] > 0.9 else 'MEDIUM' if audit['max_similarity'] > 0.8 else 'LOW'
        }
        
        # Create table
        ax.axis('off')
        
        # Alert table
        table_data = [[k, str(v)] for k, v in alerts.items()]
        table = ax.table(cellText=table_data,
                        colLabels=['Alert Type', 'Value'],
                        cellLoc='center',
                        loc='center')
        table.auto_set_font_size(False)
        table.set_fontsize(12)
        table.scale(1, 2)
        
        # Color code by risk
        if alerts['Risk Level'] == 'HIGH':
            table[(4, 1)].set_facecolor('#ffcccc')
        elif alerts['Risk Level'] == 'MEDIUM':
            table[(4, 1)].set_facecolor('#ffffcc')
        else:
            table[(4, 1)].set_facecolor('#ccffcc')
        
        ax.set_title('Contamination Alert Summary', pad=20)
        
        # Add recommendations
        recommendations_text = "\n".join(audit['risk_recommendations'][:3])
        ax.text(0.5, 0.1, f"Recommendations:\n{recommendations_text}", 
               ha='center', va='bottom', transform=ax.transAxes,
               bbox=dict(boxstyle="round,pad=0.3", facecolor='lightblue', alpha=0.7))

# Create monitoring dashboard
dashboard = ContaminationMonitoringDashboard(vcs)
dashboard.create_comprehensive_dashboard(mock_performance)

## 6. Production Implementation Guide

### Complete Framework for Real-World Deployment

In [None]:
class ProductionContaminationFramework:
    """Production-ready contamination detection and prevention framework"""
    
    def __init__(self, config: Dict):
        self.config = config
        self.vcs = DatasetVersionControl(config.get('db_path', ':memory:'))
        self.similarity_detector = ProblemSimilarityDetector()
        self.stat_detector = StatisticalContaminationDetector()
        self.monitoring_enabled = config.get('monitoring_enabled', True)
        
    def create_deployment_checklist(self) -> Dict:
        """Create comprehensive deployment checklist"""
        checklist = {
            'data_quality': {
                'duplicate_detection': 'Configure similarity thresholds',
                'temporal_validation': 'Set up temporal split validation',
                'quality_metrics': 'Define quality metric baselines',
                'automated_testing': 'Implement automated quality tests'
            },
            'contamination_prevention': {
                'similarity_monitoring': 'Set up real-time similarity monitoring',
                'temporal_analysis': 'Configure temporal trend analysis',
                'statistical_tests': 'Implement statistical contamination tests',
                'alert_system': 'Configure contamination alert system'
            },
            'version_control': {
                'change_tracking': 'Enable comprehensive change tracking',
                'version_tagging': 'Set up semantic versioning',
                'rollback_capability': 'Implement version rollback',
                'audit_logging': 'Enable full audit logging'
            },
            'monitoring': {
                'dashboard_setup': 'Deploy monitoring dashboard',
                'metric_collection': 'Configure metric collection',
                'alert_rules': 'Set up alerting rules',
                'reporting': 'Configure automated reporting'
            },
            'compliance': {
                'data_provenance': 'Implement data provenance tracking',
                'privacy_controls': 'Set up privacy controls',
                'audit_trails': 'Ensure complete audit trails',
                'documentation': 'Maintain comprehensive documentation'
            }
        }
        
        return checklist
    
    def generate_implementation_code(self) -> Dict[str, str]:
        """Generate production-ready implementation code"""
        
        code_templates = {
            'api_endpoint': '''
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel

app = FastAPI(title="Contamination Detection API")

@app.post("/detect-contamination")
async def detect_contamination(problem_data: ProblemData):
    """Detect potential contamination in new problem"""
    try:
        # Initialize detector
        detector = ProblemSimilarityDetector()
        
        # Check similarity with existing problems
        similar_problems = detector.find_similar_problems(problem_data)
        
        # Calculate contamination risk
        risk_score = detector.calculate_contamination_risk(similar_problems)
        
        return {
            "contamination_risk": risk_score,
            "similar_problems": similar_problems,
            "recommendation": "REJECT" if risk_score > 0.8 else "ACCEPT"
        }
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))
''',
            
            'monitoring_job': '''
import schedule
import time
from datetime import datetime

def run_contamination_monitoring():
    """Scheduled job for contamination monitoring"""
    detector = StatisticalContaminationDetector()
    
    # Get recent performance data
    performance_data = get_recent_performance_data()
    
    # Run contamination analysis
    analysis = detector.temporal_performance_analysis(performance_data)
    
    # Check for alerts
    for model, stats in analysis.items():
        if stats['contamination_risk']['risk_level'] == 'HIGH':
            send_alert(f"HIGH contamination risk detected for {model}")
    
    # Log results
    log_monitoring_results(analysis)

# Schedule monitoring
schedule.every(1).hours.do(run_contamination_monitoring)

while True:
    schedule.run_pending()
    time.sleep(60)
''',
            
            'data_pipeline': '''
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta

def validate_new_problems(**context):
    """Validate new problems for contamination"""
    vcs = DatasetVersionControl()
    new_problems = context['task_instance'].xcom_pull(task_ids='extract_problems')
    
    validated_problems = []
    for problem in new_problems:
        # Check similarity
        similar = vcs.find_similar_problems(problem, threshold=0.8)
        
        if not similar:
            validated_problems.append(problem)
            vcs.add_problem(problem)
        else:
            log_rejected_problem(problem, similar)
    
    return validated_problems

dag = DAG(
    'contamination_prevention_pipeline',
    default_args={
        'owner': 'data-team',
        'depends_on_past': False,
        'start_date': datetime(2024, 1, 1),
        'retries': 1,
        'retry_delay': timedelta(minutes=5)
    },
    schedule_interval='@daily'
)

validate_task = PythonOperator(
    task_id='validate_contamination',
    python_callable=validate_new_problems,
    dag=dag
)
'''
        }
        
        return code_templates
    
    def create_testing_suite(self) -> Dict[str, str]:
        """Create comprehensive testing suite"""
        
        test_suite = {
            'unit_tests': '''
import unittest
from contamination_detector import ProblemSimilarityDetector

class TestContaminationDetection(unittest.TestCase):
    
    def setUp(self):
        self.detector = ProblemSimilarityDetector()
    
    def test_similarity_calculation(self):
        """Test similarity calculation between problems"""
        problem1 = create_mock_problem("problem1")
        problem2 = create_mock_problem("problem2")
        
        similarity = self.detector.calculate_similarity(problem1, problem2)
        
        self.assertIsInstance(similarity, float)
        self.assertGreaterEqual(similarity, 0)
        self.assertLessEqual(similarity, 1)
    
    def test_contamination_detection(self):
        """Test contamination detection logic"""
        # Test with high similarity (should detect contamination)
        high_sim_problems = [create_similar_problems()]
        risk = self.detector.calculate_contamination_risk(high_sim_problems)
        self.assertGreater(risk, 0.8)
        
        # Test with low similarity (should not detect contamination)
        low_sim_problems = [create_different_problems()]
        risk = self.detector.calculate_contamination_risk(low_sim_problems)
        self.assertLess(risk, 0.3)

if __name__ == '__main__':
    unittest.main()
''',
            
            'integration_tests': '''
import pytest
from contamination_framework import ProductionContaminationFramework

@pytest.fixture
def framework():
    config = {
        'db_path': ':memory:',
        'monitoring_enabled': True,
        'similarity_threshold': 0.8
    }
    return ProductionContaminationFramework(config)

def test_end_to_end_workflow(framework):
    """Test complete contamination detection workflow"""
    # Add initial problems
    problems = create_test_problems(10)
    for problem in problems:
        framework.vcs.add_problem(problem)
    
    # Create version
    version_id = framework.vcs.create_version("v1.0.0", "Test version")
    assert version_id is not None
    
    # Test contamination detection
    new_problem = create_similar_problem(problems[0])
    result = framework.detect_contamination(new_problem)
    
    assert result['contamination_risk'] > 0.8
    assert result['recommendation'] == 'REJECT'

def test_performance_monitoring(framework):
    """Test performance monitoring functionality"""
    # Create mock performance data
    performance_data = create_mock_performance_data()
    
    # Run analysis
    analysis = framework.stat_detector.temporal_performance_analysis(performance_data)
    
    assert 'CleanModel' in analysis
    assert 'ContaminatedModel' in analysis
    assert analysis['ContaminatedModel']['contamination_risk']['risk_level'] == 'HIGH'
''',
            
            'load_tests': '''
import time
import threading
from concurrent.futures import ThreadPoolExecutor

def load_test_similarity_detection():
    """Load test similarity detection performance"""
    detector = ProblemSimilarityDetector()
    problems = create_test_problems(100)
    
    start_time = time.time()
    
    # Test with multiple threads
    with ThreadPoolExecutor(max_workers=10) as executor:
        futures = []
        for i in range(0, len(problems), 10):
            batch = problems[i:i+10]
            future = executor.submit(detector.calculate_similarity_matrix, batch)
            futures.append(future)
        
        # Wait for all to complete
        for future in futures:
            future.result()
    
    end_time = time.time()
    
    print(f"Processed {len(problems)} problems in {end_time - start_time:.2f} seconds")
    print(f"Throughput: {len(problems) / (end_time - start_time):.2f} problems/second")

if __name__ == '__main__':
    load_test_similarity_detection()
'''
        }
        
        return test_suite
    
    def generate_documentation(self) -> str:
        """Generate comprehensive documentation"""
        
        documentation = '''
# Advanced Contamination Detection Framework

## Overview

This framework provides comprehensive contamination detection and prevention for ML datasets, particularly focusing on temporal datasets used for LLM evaluation.

## Key Features

### 1. Multi-Dimensional Similarity Detection
- Text similarity using TF-IDF and cosine similarity
- Algorithmic pattern matching
- Constraint signature analysis
- Input/output pattern recognition
- Code pattern detection

### 2. Statistical Contamination Detection
- Performance distribution analysis
- Temporal trend analysis
- Change point detection
- Bimodality testing
- Outlier detection

### 3. Dataset Version Control
- Complete change tracking
- Checksums and integrity verification
- Rollback capabilities
- Quality metrics tracking
- Contamination scoring

### 4. Real-time Monitoring
- Live contamination monitoring
- Alert system
- Dashboard visualization
- Automated reporting

## Usage Examples

### Basic Setup
```python
from contamination_framework import ProductionContaminationFramework

# Initialize framework
config = {
    'db_path': 'contamination.db',
    'similarity_threshold': 0.8,
    'monitoring_enabled': True
}
framework = ProductionContaminationFramework(config)
```

### Add Problems with Contamination Check
```python
# Add new problem
problem = Problem(...)
result = framework.add_problem_with_validation(problem)

if result['accepted']:
    print("Problem accepted")
else:
    print(f"Problem rejected: {result['reason']}")
```

### Monitor Performance
```python
# Get performance data
performance_data = get_model_performance()

# Analyze for contamination
analysis = framework.analyze_contamination(performance_data)

for model, risk in analysis.items():
    if risk['level'] == 'HIGH':
        send_alert(f"Contamination detected in {model}")
```

## Configuration

### Similarity Thresholds
- `similarity_threshold`: 0.8 (default) - Problems above this similarity are flagged
- `high_risk_threshold`: 0.95 - Problems above this are automatically rejected

### Statistical Thresholds
- `contamination_risk_threshold`: 0.7 - Performance patterns above this trigger alerts
- `temporal_correlation_threshold`: -0.5 - Negative correlations below this are suspicious

### Monitoring Settings
- `monitoring_interval`: 3600 (1 hour) - How often to run contamination checks
- `alert_cooldown`: 1800 (30 minutes) - Minimum time between similar alerts

## Best Practices

### 1. Data Ingestion
- Always run similarity checks before adding new problems
- Maintain proper temporal ordering
- Validate all metadata
- Use checksums for integrity

### 2. Monitoring
- Set up automated monitoring jobs
- Configure appropriate alert thresholds
- Regularly review contamination reports
- Maintain audit logs

### 3. Version Management
- Create versions at regular intervals
- Tag versions with semantic versioning
- Maintain rollback capabilities
- Document all changes

## Troubleshooting

### High False Positive Rate
- Adjust similarity thresholds
- Review algorithm keyword detection
- Check constraint extraction logic

### Performance Issues
- Enable similarity caching
- Use batch processing for large datasets
- Consider distributed processing for very large datasets

### Memory Usage
- Use streaming for large similarity calculations
- Implement data pagination
- Clear caches periodically

## API Reference

[Detailed API documentation would follow...]
        '''
        
        return documentation
    
    def create_deployment_plan(self) -> Dict:
        """Create detailed deployment plan"""
        
        plan = {
            'phase_1_preparation': {
                'duration': '1-2 weeks',
                'tasks': [
                    'Set up development environment',
                    'Install dependencies',
                    'Configure database',
                    'Set up monitoring infrastructure',
                    'Create test data'
                ],
                'deliverables': [
                    'Development environment ready',
                    'Database schema deployed',
                    'Monitoring stack configured'
                ]
            },
            'phase_2_implementation': {
                'duration': '2-3 weeks',
                'tasks': [
                    'Implement core contamination detection',
                    'Build similarity detection algorithms',
                    'Create statistical analysis modules',
                    'Develop version control system',
                    'Build monitoring dashboard'
                ],
                'deliverables': [
                    'Core framework implemented',
                    'All detection algorithms working',
                    'Dashboard deployed'
                ]
            },
            'phase_3_testing': {
                'duration': '1-2 weeks',
                'tasks': [
                    'Unit testing',
                    'Integration testing',
                    'Load testing',
                    'Security testing',
                    'User acceptance testing'
                ],
                'deliverables': [
                    'All tests passing',
                    'Performance validated',
                    'Security cleared'
                ]
            },
            'phase_4_deployment': {
                'duration': '1 week',
                'tasks': [
                    'Production deployment',
                    'Data migration',
                    'Monitor configuration',
                    'Alert setup',
                    'Documentation delivery'
                ],
                'deliverables': [
                    'System live in production',
                    'Monitoring active',
                    'Team trained'
                ]
            },
            'phase_5_optimization': {
                'duration': 'Ongoing',
                'tasks': [
                    'Performance optimization',
                    'Threshold tuning',
                    'Feature enhancements',
                    'Regular maintenance'
                ],
                'deliverables': [
                    'Optimized performance',
                    'Regular reports',
                    'Continuous improvements'
                ]
            }
        }
        
        return plan

# Create production framework demonstration
production_config = {
    'db_path': ':memory:',
    'similarity_threshold': 0.8,
    'monitoring_enabled': True,
    'high_risk_threshold': 0.95,
    'contamination_risk_threshold': 0.7
}

framework = ProductionContaminationFramework(production_config)

print("Production Framework Implementation Guide")
print("=========================================\n")

# Generate deployment checklist
checklist = framework.create_deployment_checklist()
print("Deployment Checklist:")
for category, items in checklist.items():
    print(f"\n{category.upper()}:")
    for item, description in items.items():
        print(f"  ☐ {item}: {description}")

# Generate implementation code
code_templates = framework.generate_implementation_code()
print("\n\nCode Templates Generated:")
for template_name in code_templates.keys():
    print(f"  - {template_name}.py")

# Generate testing suite
test_suite = framework.create_testing_suite()
print("\nTesting Suite Generated:")
for test_type in test_suite.keys():
    print(f"  - {test_type}.py")

# Generate deployment plan
deployment_plan = framework.create_deployment_plan()
print("\nDeployment Plan:")
for phase, details in deployment_plan.items():
    print(f"\n{phase.upper()} ({details['duration']}):")
    for task in details['tasks'][:3]:  # Show first 3 tasks
        print(f"  • {task}")
    if len(details['tasks']) > 3:
        print(f"  ... and {len(details['tasks']) - 3} more tasks")

print("\n\n🎯 Framework ready for production deployment!")
print("Complete documentation and code templates generated.")
print("Follow the deployment plan for systematic implementation.")

## 7. Key Takeaways and Best Practices

### Critical Insights Beyond the Paper:

1. **Multi-Dimensional Contamination**: Simple temporal splits are insufficient - need text, algorithmic, and code pattern analysis
2. **Statistical Detection Methods**: Performance distribution analysis reveals memorization patterns
3. **Proactive Prevention**: Real-time similarity monitoring prevents contamination at ingestion
4. **Version Control is Essential**: Complete audit trails and rollback capabilities are production requirements

### Advanced Contamination Patterns:

**Subtle Contamination Types:**
- **Paraphrased Problems**: Same logic, different wording
- **Cross-Platform Leakage**: LeetCode → CodeForces → AtCoder
- **Synthetic Contamination**: Generated problems based on test sets
- **Human Contamination**: Evaluator bias from known solutions

**Detection Signatures:**
- **Bimodal Performance**: Clear separation between "seen" and "unseen" problems
- **Temporal Degradation**: Performance decline for post-training problems
- **Outlier Clustering**: Unusually high scores on specific problem types
- **Pattern Consistency**: Identical solution patterns across similar problems

### Production Implementation Guidelines:

1. **Similarity Thresholds**:
   - Warning: 0.8+ similarity
   - Rejection: 0.95+ similarity
   - Manual review: 0.85-0.94 range

2. **Statistical Monitoring**:
   - Track temporal correlations < -0.5
   - Monitor bimodality in score distributions
   - Alert on performance degradation > 2 standard deviations

3. **Version Control Requirements**:
   - Immutable versions with checksums
   - Complete change audit trails
   - Rollback capabilities
   - Quality metric tracking

4. **Real-time Monitoring**:
   - Hourly contamination checks
   - Automated alert system
   - Dashboard with risk gauges
   - Regular audit reports

### Research Extensions:

1. **Cross-Modal Contamination**: Detection across code/text/image modalities
2. **Adversarial Contamination**: Sophisticated contamination designed to evade detection
3. **Federated Learning Contamination**: Detection in distributed training scenarios
4. **Dynamic Threshold Adjustment**: AI-driven threshold optimization

### Key Implementation Priorities:

1. **Start with Similarity Detection**: Highest ROI for contamination prevention
2. **Add Statistical Monitoring**: Critical for detecting memorization patterns
3. **Implement Version Control**: Essential for audit and compliance
4. **Build Monitoring Dashboard**: Enables proactive contamination management
5. **Create Alert System**: Ensures rapid response to contamination events

This advanced contamination detection framework goes far beyond the paper's basic temporal analysis, providing production-ready tools for comprehensive contamination prevention and detection in modern AI evaluation systems.