# Focused Learning: Temporal Dataset Design & Contamination-Free Evaluation

## Learning Objectives
1. Understand the critical importance of temporal splits in ML evaluation
2. Learn how to design contamination-free benchmarks for LLMs
3. Master techniques for detecting and preventing data leakage
4. Implement temporal validation strategies for code generation tasks

## Concept Source
- **Paper Section**: Section 2.2 (Dataset Overview) and Section 3 (Holistic Evaluation)
- **Key Figures**: Figure 3 (Monthly pass rates)
- **Critical Quote**: "To ensure temporal validity, we adopted a strict time-based split: problems released after July 1, 2024, form the test set for benchmarking, while those released earlier constitute the training set." (Page 3)

## 1. The Data Contamination Problem in LLMs

### Why is this critical?
Large Language Models are trained on massive internet datasets. When evaluating these models, we face a fundamental challenge: **How do we know if the model has already seen the test data during training?**

This is especially problematic for:
- Code generation (solutions posted online)
- Academic benchmarks (widely discussed)
- Competition problems (public repositories)

In [None]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import matplotlib.pyplot as plt
import seaborn as sns
from typing import List, Dict, Tuple

# Set up visualization
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

## 2. Understanding Temporal Splits

### Mathematical Foundation

Let's define the temporal split formally:

Given a dataset $D = \{(x_i, y_i, t_i)\}_{i=1}^n$ where:
- $x_i$ is the input (problem description)
- $y_i$ is the output (solution)
- $t_i$ is the timestamp (release date)

We define a cutoff time $t_c$ (July 1, 2024) such that:
- Training set: $D_{train} = \{(x_i, y_i, t_i) : t_i < t_c\}$
- Test set: $D_{test} = \{(x_i, y_i, t_i) : t_i \geq t_c\}$

This ensures: $D_{train} \cap D_{test} = \emptyset$ **temporally**

In [None]:
class TemporalDataset:
    """Implementation of temporal dataset splitting for contamination-free evaluation"""
    
    def __init__(self, cutoff_date: str = "2024-07-01"):
        self.cutoff_date = pd.to_datetime(cutoff_date)
        self.problems = []
        
    def add_problem(self, problem_id: str, release_date: str, difficulty: str, content: str):
        """Add a problem with temporal metadata"""
        self.problems.append({
            'problem_id': problem_id,
            'release_date': pd.to_datetime(release_date),
            'difficulty': difficulty,
            'content': content
        })
    
    def create_temporal_split(self) -> Tuple[pd.DataFrame, pd.DataFrame]:
        """Create train/test split based on temporal cutoff"""
        df = pd.DataFrame(self.problems)
        
        # Temporal split
        train_mask = df['release_date'] < self.cutoff_date
        train_df = df[train_mask].copy()
        test_df = df[~train_mask].copy()
        
        # Add split labels
        train_df['split'] = 'train'
        test_df['split'] = 'test'
        
        return train_df, test_df
    
    def analyze_temporal_distribution(self, train_df: pd.DataFrame, test_df: pd.DataFrame):
        """Analyze the temporal distribution of splits"""
        fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 8))
        
        # Timeline visualization
        all_dates = pd.concat([train_df['release_date'], test_df['release_date']])
        
        # Plot 1: Problem release timeline
        ax1.scatter(train_df['release_date'], [1]*len(train_df), 
                   alpha=0.6, label='Training', color='blue', s=50)
        ax1.scatter(test_df['release_date'], [1]*len(test_df), 
                   alpha=0.6, label='Test', color='red', s=50)
        ax1.axvline(x=self.cutoff_date, color='green', linestyle='--', 
                   linewidth=2, label=f'Cutoff: {self.cutoff_date.date()}')
        ax1.set_ylim(0.5, 1.5)
        ax1.set_ylabel('Dataset')
        ax1.set_title('Temporal Distribution of Problems')
        ax1.legend()
        ax1.set_yticks([])
        
        # Plot 2: Monthly problem counts
        train_monthly = train_df.groupby(train_df['release_date'].dt.to_period('M')).size()
        test_monthly = test_df.groupby(test_df['release_date'].dt.to_period('M')).size()
        
        months = pd.period_range(start=all_dates.min(), end=all_dates.max(), freq='M')
        train_counts = [train_monthly.get(m, 0) for m in months]
        test_counts = [test_monthly.get(m, 0) for m in months]
        
        x = range(len(months))
        width = 0.35
        
        ax2.bar([i - width/2 for i in x], train_counts, width, label='Training', color='blue', alpha=0.7)
        ax2.bar([i + width/2 for i in x], test_counts, width, label='Test', color='red', alpha=0.7)
        
        ax2.set_xlabel('Month')
        ax2.set_ylabel('Number of Problems')
        ax2.set_title('Monthly Problem Distribution')
        ax2.set_xticks(x[::3])  # Show every 3rd month
        ax2.set_xticklabels([str(m) for m in months[::3]], rotation=45)
        ax2.legend()
        
        plt.tight_layout()
        plt.show()
        
        # Statistics
        print(f"Training set: {len(train_df)} problems ({len(train_df)/(len(train_df)+len(test_df))*100:.1f}%)")
        print(f"Test set: {len(test_df)} problems ({len(test_df)/(len(train_df)+len(test_df))*100:.1f}%)")
        print(f"\nDate ranges:")
        print(f"Training: {train_df['release_date'].min().date()} to {train_df['release_date'].max().date()}")
        print(f"Test: {test_df['release_date'].min().date()} to {test_df['release_date'].max().date()}")

In [None]:
# Create mock dataset with realistic temporal distribution
np.random.seed(42)
dataset = TemporalDataset(cutoff_date="2024-07-01")

# Generate problems with realistic release pattern
# LeetCode releases ~350 problems per year (from paper)
start_date = datetime(2020, 1, 1)
end_date = datetime(2025, 3, 1)

current_date = start_date
problem_id = 1000

while current_date < end_date:
    # Weekly contests (4 problems) + daily problems
    problems_this_month = np.random.poisson(29)  # ~350/year
    
    for _ in range(problems_this_month):
        # Random day within the month
        day_offset = np.random.randint(0, 28)
        release_date = current_date + timedelta(days=day_offset)
        
        # Difficulty distribution from paper: Easy 23.91%, Medium 52.21%, Hard 23.88%
        difficulty = np.random.choice(['Easy', 'Medium', 'Hard'], 
                                    p=[0.2391, 0.5221, 0.2388])
        
        dataset.add_problem(
            problem_id=f"LC{problem_id}",
            release_date=release_date.strftime("%Y-%m-%d"),
            difficulty=difficulty,
            content=f"Problem {problem_id} content"
        )
        problem_id += 1
    
    # Move to next month
    if current_date.month == 12:
        current_date = current_date.replace(year=current_date.year + 1, month=1)
    else:
        current_date = current_date.replace(month=current_date.month + 1)

# Create temporal split
train_df, test_df = dataset.create_temporal_split()

# Analyze distribution
dataset.analyze_temporal_distribution(train_df, test_df)

## 3. Contamination Detection Techniques

### How do we detect if a model has seen test data?

The paper uses **temporal performance analysis** to detect contamination. The key insight:
- If a model has memorized solutions, performance should **decrease** for newer problems
- Consistent performance across time suggests genuine capability

In [None]:
class ContaminationDetector:
    """Detect potential data contamination in model evaluation"""
    
    def __init__(self, model_release_date: str):
        self.model_release_date = pd.to_datetime(model_release_date)
        
    def simulate_model_performance(self, test_df: pd.DataFrame, 
                                 contaminated: bool = False) -> pd.DataFrame:
        """Simulate model performance with/without contamination"""
        results = []
        
        for _, problem in test_df.iterrows():
            base_difficulty_score = {
                'Easy': 0.8,
                'Medium': 0.5,
                'Hard': 0.2
            }[problem['difficulty']]
            
            if contaminated:
                # Model performance degrades for problems after model training
                if problem['release_date'] > self.model_release_date:
                    # Newer problems: lower performance
                    performance_multiplier = 0.7
                else:
                    # Older problems: might have seen them
                    performance_multiplier = 1.2
            else:
                # Genuine model: consistent performance with small random variation
                performance_multiplier = 1.0 + np.random.normal(0, 0.1)
            
            # Calculate pass probability
            pass_prob = min(1.0, base_difficulty_score * performance_multiplier)
            passed = np.random.random() < pass_prob
            
            results.append({
                'problem_id': problem['problem_id'],
                'release_date': problem['release_date'],
                'difficulty': problem['difficulty'],
                'passed': passed,
                'month': problem['release_date'].to_period('M')
            })
            
        return pd.DataFrame(results)
    
    def analyze_temporal_performance(self, results_df: pd.DataFrame, model_name: str):
        """Analyze performance over time to detect contamination"""
        # Calculate monthly pass rates
        monthly_stats = results_df.groupby('month').agg({
            'passed': ['sum', 'count']
        })
        monthly_stats.columns = ['passed', 'total']
        monthly_stats['pass_rate'] = monthly_stats['passed'] / monthly_stats['total'] * 100
        
        # Fit trend line
        x = np.arange(len(monthly_stats))
        y = monthly_stats['pass_rate'].values
        z = np.polyfit(x, y, 1)
        p = np.poly1d(z)
        
        # Plot
        plt.figure(figsize=(12, 6))
        plt.plot(monthly_stats.index.astype(str), y, 'o-', label=model_name, markersize=8)
        plt.plot(monthly_stats.index.astype(str), p(x), '--', 
                label=f'Trend (slope: {z[0]:.2f})', alpha=0.8)
        
        # Mark model release date
        model_month = self.model_release_date.to_period('M')
        if model_month in monthly_stats.index:
            idx = monthly_stats.index.get_loc(model_month)
            plt.axvline(x=idx, color='red', linestyle=':', 
                       label=f'Model Release: {self.model_release_date.date()}')
        
        plt.xlabel('Month')
        plt.ylabel('Pass Rate (%)')
        plt.title(f'Temporal Performance Analysis: {model_name}')
        plt.xticks(rotation=45)
        plt.legend()
        plt.grid(True, alpha=0.3)
        plt.tight_layout()
        plt.show()
        
        # Contamination indicators
        slope = z[0]
        variance = np.var(y)
        
        print(f"\nContamination Analysis for {model_name}:")
        print(f"Temporal slope: {slope:.3f}")
        print(f"Performance variance: {variance:.2f}")
        
        if slope < -2:
            print("⚠️  WARNING: Strong negative trend suggests possible contamination")
        elif slope < -1:
            print("⚠️  CAUTION: Negative trend may indicate contamination")
        else:
            print("✓ No strong evidence of contamination")
            
        return slope, variance

In [None]:
# Demonstrate contamination detection
detector_clean = ContaminationDetector(model_release_date="2024-08-01")
detector_contaminated = ContaminationDetector(model_release_date="2024-08-01")

# Simulate performance for both scenarios
clean_results = detector_clean.simulate_model_performance(test_df, contaminated=False)
contaminated_results = detector_contaminated.simulate_model_performance(test_df, contaminated=True)

# Analyze both
print("=== Clean Model (No Contamination) ===")
clean_slope, clean_var = detector_clean.analyze_temporal_performance(clean_results, "GPT-4o (Clean)")

print("\n=== Contaminated Model ===")
cont_slope, cont_var = detector_contaminated.analyze_temporal_performance(contaminated_results, "GPT-4o (Contaminated)")

## 4. Advanced Temporal Validation Strategies

### Beyond Simple Date Cutoffs

The paper's approach is sophisticated but we can extend it further:

In [None]:
class AdvancedTemporalValidation:
    """Advanced strategies for temporal validation"""
    
    @staticmethod
    def rolling_window_evaluation(df: pd.DataFrame, window_months: int = 6) -> pd.DataFrame:
        """Implement rolling window evaluation strategy"""
        df = df.sort_values('release_date')
        results = []
        
        # Create rolling windows
        for i in range(window_months, len(df.groupby(df['release_date'].dt.to_period('M')))):
            # Define window
            end_date = df['release_date'].min() + pd.DateOffset(months=i)
            start_date = end_date - pd.DateOffset(months=window_months)
            
            # Split data
            train_mask = (df['release_date'] >= start_date) & (df['release_date'] < end_date)
            test_mask = (df['release_date'] >= end_date) & \
                       (df['release_date'] < end_date + pd.DateOffset(months=1))
            
            if test_mask.sum() > 0:
                results.append({
                    'window_end': end_date,
                    'train_size': train_mask.sum(),
                    'test_size': test_mask.sum(),
                    'train_period': f"{start_date.date()} to {end_date.date()}",
                    'test_period': f"{end_date.date()} to {(end_date + pd.DateOffset(months=1)).date()}"
                })
        
        return pd.DataFrame(results)
    
    @staticmethod
    def stratified_temporal_split(df: pd.DataFrame, test_ratio: float = 0.2) -> Tuple[pd.DataFrame, pd.DataFrame]:
        """Create temporal split maintaining difficulty distribution"""
        # Sort by date
        df = df.sort_values('release_date')
        
        # Calculate split point to get desired test ratio
        n_test = int(len(df) * test_ratio)
        
        # Find date that gives us closest to desired ratio while maintaining difficulty balance
        best_date = None
        best_score = float('inf')
        
        for date in df['release_date'].unique()[-n_test*2:]:
            test_mask = df['release_date'] >= date
            test_df = df[test_mask]
            train_df = df[~test_mask]
            
            if len(test_df) < n_test * 0.8 or len(test_df) > n_test * 1.2:
                continue
            
            # Calculate difficulty distribution difference
            train_dist = train_df['difficulty'].value_counts(normalize=True)
            test_dist = test_df['difficulty'].value_counts(normalize=True)
            
            # KL divergence as distribution difference metric
            kl_div = sum(test_dist.get(d, 0.001) * np.log(test_dist.get(d, 0.001) / train_dist.get(d, 0.001)) 
                        for d in ['Easy', 'Medium', 'Hard'])
            
            score = abs(len(test_df) - n_test) + kl_div * 100
            
            if score < best_score:
                best_score = score
                best_date = date
        
        # Create final split
        test_mask = df['release_date'] >= best_date
        return df[~test_mask], df[test_mask]
    
    @staticmethod
    def visualize_validation_strategies(df: pd.DataFrame):
        """Compare different validation strategies"""
        fig, axes = plt.subplots(2, 2, figsize=(15, 10))
        
        # Strategy 1: Fixed cutoff (paper's approach)
        cutoff = pd.to_datetime('2024-07-01')
        train1 = df[df['release_date'] < cutoff]
        test1 = df[df['release_date'] >= cutoff]
        
        ax = axes[0, 0]
        ax.hist([train1['release_date'], test1['release_date']], 
               label=['Train', 'Test'], alpha=0.7, bins=20)
        ax.axvline(x=cutoff, color='red', linestyle='--', label='Cutoff')
        ax.set_title('Strategy 1: Fixed Cutoff Date')
        ax.legend()
        
        # Strategy 2: Stratified temporal split
        train2, test2 = AdvancedTemporalValidation.stratified_temporal_split(df)
        
        ax = axes[0, 1]
        width = 0.35
        difficulties = ['Easy', 'Medium', 'Hard']
        train_counts = [len(train2[train2['difficulty'] == d]) for d in difficulties]
        test_counts = [len(test2[test2['difficulty'] == d]) for d in difficulties]
        
        x = np.arange(len(difficulties))
        ax.bar(x - width/2, train_counts, width, label='Train', alpha=0.7)
        ax.bar(x + width/2, test_counts, width, label='Test', alpha=0.7)
        ax.set_xticks(x)
        ax.set_xticklabels(difficulties)
        ax.set_title('Strategy 2: Stratified Split (Difficulty Balance)')
        ax.legend()
        
        # Strategy 3: Rolling window
        rolling_results = AdvancedTemporalValidation.rolling_window_evaluation(df)
        
        ax = axes[1, 0]
        ax.plot(rolling_results.index, rolling_results['train_size'], label='Train Size')
        ax.plot(rolling_results.index, rolling_results['test_size'], label='Test Size')
        ax.set_title('Strategy 3: Rolling Window Evaluation')
        ax.set_xlabel('Window Index')
        ax.set_ylabel('Dataset Size')
        ax.legend()
        
        # Strategy comparison
        ax = axes[1, 1]
        strategies = ['Fixed Cutoff', 'Stratified', 'Rolling Window']
        train_sizes = [len(train1), len(train2), rolling_results['train_size'].mean()]
        test_sizes = [len(test1), len(test2), rolling_results['test_size'].mean()]
        
        x = np.arange(len(strategies))
        ax.bar(x - width/2, train_sizes, width, label='Avg Train Size')
        ax.bar(x + width/2, test_sizes, width, label='Avg Test Size')
        ax.set_xticks(x)
        ax.set_xticklabels(strategies)
        ax.set_title('Strategy Comparison')
        ax.legend()
        
        plt.tight_layout()
        plt.show()

# Apply advanced strategies
all_df = pd.concat([train_df, test_df])
AdvancedTemporalValidation.visualize_validation_strategies(all_df)

## 5. Practical Implementation: Building Your Own Temporal Benchmark

Let's implement a complete framework for creating contamination-free benchmarks:

In [None]:
class TemporalBenchmarkFramework:
    """Complete framework for temporal benchmark creation and evaluation"""
    
    def __init__(self, name: str, update_frequency: str = 'monthly'):
        self.name = name
        self.update_frequency = update_frequency
        self.problems = []
        self.evaluations = []
        
    def add_problem(self, problem: Dict):
        """Add a problem with required temporal metadata"""
        required_fields = ['id', 'release_date', 'content', 'solution', 'test_cases']
        if not all(field in problem for field in required_fields):
            raise ValueError(f"Problem must contain: {required_fields}")
        
        problem['release_date'] = pd.to_datetime(problem['release_date'])
        self.problems.append(problem)
    
    def create_evaluation_snapshot(self, snapshot_date: str, 
                                 lookback_months: int = 6) -> Dict:
        """Create an evaluation snapshot at a specific date"""
        snapshot_date = pd.to_datetime(snapshot_date)
        cutoff_date = snapshot_date - pd.DateOffset(months=lookback_months)
        
        # Filter problems
        df = pd.DataFrame(self.problems)
        train_problems = df[df['release_date'] < cutoff_date]
        test_problems = df[(df['release_date'] >= cutoff_date) & 
                          (df['release_date'] <= snapshot_date)]
        
        snapshot = {
            'snapshot_date': snapshot_date,
            'cutoff_date': cutoff_date,
            'train_size': len(train_problems),
            'test_size': len(test_problems),
            'train_ids': train_problems['id'].tolist(),
            'test_ids': test_problems['id'].tolist()
        }
        
        return snapshot
    
    def evaluate_model(self, model_name: str, model_outputs: Dict[str, str], 
                      snapshot: Dict) -> Dict:
        """Evaluate model on a specific snapshot"""
        results = {
            'model': model_name,
            'snapshot_date': snapshot['snapshot_date'],
            'test_problems': len(snapshot['test_ids']),
            'passed': 0,
            'results_by_problem': []
        }
        
        # Evaluate each test problem
        for problem_id in snapshot['test_ids']:
            if problem_id in model_outputs:
                # In real implementation, execute and validate
                # For demo, simulate pass/fail
                passed = np.random.random() > 0.5
                results['passed'] += passed
                results['results_by_problem'].append({
                    'problem_id': problem_id,
                    'passed': passed
                })
        
        results['pass_rate'] = results['passed'] / results['test_problems'] * 100
        self.evaluations.append(results)
        
        return results
    
    def analyze_benchmark_health(self):
        """Analyze the health and validity of the benchmark"""
        df = pd.DataFrame(self.problems)
        
        fig, axes = plt.subplots(2, 2, figsize=(15, 10))
        
        # 1. Problem accumulation over time
        ax = axes[0, 0]
        df_sorted = df.sort_values('release_date')
        ax.plot(df_sorted['release_date'], range(1, len(df_sorted) + 1))
        ax.set_xlabel('Date')
        ax.set_ylabel('Cumulative Problems')
        ax.set_title('Problem Accumulation Over Time')
        
        # 2. Monthly release rate
        ax = axes[0, 1]
        monthly_counts = df.groupby(df['release_date'].dt.to_period('M')).size()
        ax.bar(monthly_counts.index.astype(str), monthly_counts.values)
        ax.set_xlabel('Month')
        ax.set_ylabel('Problems Released')
        ax.set_title('Monthly Release Rate')
        ax.tick_params(axis='x', rotation=45)
        
        # 3. Temporal coverage gaps
        ax = axes[1, 0]
        dates = df['release_date'].sort_values()
        gaps = [(dates.iloc[i+1] - dates.iloc[i]).days for i in range(len(dates)-1)]
        ax.hist(gaps, bins=30, alpha=0.7)
        ax.axvline(x=np.mean(gaps), color='red', linestyle='--', 
                  label=f'Mean: {np.mean(gaps):.1f} days')
        ax.set_xlabel('Gap Between Problems (days)')
        ax.set_ylabel('Frequency')
        ax.set_title('Temporal Coverage Analysis')
        ax.legend()
        
        # 4. Benchmark statistics
        ax = axes[1, 1]
        ax.axis('off')
        
        stats_text = f"""
        Benchmark Statistics:
        
        Total Problems: {len(df)}
        Date Range: {df['release_date'].min().date()} to {df['release_date'].max().date()}
        Avg Monthly Release: {len(df) / ((df['release_date'].max() - df['release_date'].min()).days / 30):.1f}
        
        Recommended Usage:
        - Use 6-month lookback for evaluation
        - Update monthly with new problems
        - Monitor for temporal gaps > 30 days
        """
        ax.text(0.1, 0.5, stats_text, transform=ax.transAxes, 
               fontsize=12, verticalalignment='center')
        
        plt.tight_layout()
        plt.show()

# Create and demonstrate the framework
benchmark = TemporalBenchmarkFramework("CodeBenchmark-2025")

# Add mock problems
for i, row in all_df.iterrows():
    benchmark.add_problem({
        'id': row['problem_id'],
        'release_date': row['release_date'],
        'content': row['content'],
        'solution': f"Solution for {row['problem_id']}",
        'test_cases': [{'input': 'test', 'output': 'result'}]
    })

# Create evaluation snapshot
snapshot = benchmark.create_evaluation_snapshot('2025-01-01', lookback_months=6)
print(f"Snapshot created: {snapshot['train_size']} train, {snapshot['test_size']} test problems")

# Analyze benchmark health
benchmark.analyze_benchmark_health()

## 6. Key Takeaways and Best Practices

### Critical Insights from the Paper:

1. **Temporal Splits are Essential**: Traditional random splits fail for LLMs due to training data contamination

2. **Detection is Possible**: Performance degradation over time reveals contamination

3. **Regular Updates Matter**: The paper emphasizes "live" benchmarks that continuously add new problems

### Best Practices for Temporal Benchmarks:

1. **Clear Cutoff Dates**: Document and enforce strict temporal boundaries
2. **Sufficient Test Data**: Ensure enough post-cutoff problems for reliable evaluation
3. **Monitor Performance Trends**: Regular analysis can detect contamination
4. **Version Control**: Track which problems were available at each evaluation date
5. **Transparency**: Always report model training dates alongside evaluation dates

### Future Research Directions:

1. **Dynamic Cutoffs**: Adjust based on model release dates
2. **Contamination Scoring**: Quantify the degree of potential contamination
3. **Cross-Domain Validation**: Apply temporal splits to other domains (math, science)
4. **Adversarial Testing**: Create problems specifically designed to detect memorization

In [None]:
# Final implementation: Contamination score calculator
def calculate_contamination_score(performance_data: pd.DataFrame, 
                                model_release_date: str) -> float:
    """Calculate a contamination score based on temporal performance patterns"""
    
    model_date = pd.to_datetime(model_release_date)
    
    # Split into before/after model release
    before = performance_data[performance_data['release_date'] < model_date]
    after = performance_data[performance_data['release_date'] >= model_date]
    
    if len(before) == 0 or len(after) == 0:
        return 0.0
    
    # Calculate performance difference
    before_rate = before['passed'].mean()
    after_rate = after['passed'].mean()
    
    # Calculate temporal correlation
    days_since_release = (performance_data['release_date'] - model_date).dt.days
    correlation = np.corrcoef(days_since_release, performance_data['passed'])[0, 1]
    
    # Contamination score (0-1, higher = more likely contaminated)
    performance_drop = max(0, before_rate - after_rate)
    correlation_factor = max(0, -correlation)  # Negative correlation indicates contamination
    
    contamination_score = (performance_drop + correlation_factor) / 2
    
    return min(1.0, contamination_score)

# Example usage
score = calculate_contamination_score(clean_results, "2024-08-01")
print(f"Clean model contamination score: {score:.3f}")

score = calculate_contamination_score(contaminated_results, "2024-08-01")
print(f"Contaminated model contamination score: {score:.3f}")