# Data Exploration and Quality Assessment

This notebook provides a comprehensive exploration of the StereoWipe evaluation dataset, including:
- Data loading and structure analysis
- Stereotype rates across models and categories
- Statistical distributions and patterns
- Data quality checks and validation

## Overview

StereoWipe evaluates stereotypical content in LLM responses using an LLM-as-a-Judge paradigm. The key metrics include:
- **Stereotype Rate (SR)**: Percentage of responses flagged as stereotypical
- **Stereotype Severity Score (SSS)**: Average severity of stereotypical content
- **Conditional Stereotype Severity Score (CSSS)**: SSS conditional on stereotypical responses
- **Weighted Overall Stereotyping Index (WOSI)**: Category-weighted composite score

In [None]:
# Import required libraries
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import sqlite3
from typing import Dict, List, Optional, Tuple
import warnings
warnings.filterwarnings('ignore')

# Set up plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 12

# Add parent directory to path for imports
import sys
sys.path.append('..')

from biaswipe.data_loader import DataLoader
from biaswipe.metrics import MetricsCalculator
from biaswipe.report import ReportGenerator

## 1. Data Loading and Structure Analysis

In [None]:
# Load sample data
data_loader = DataLoader()

# Load prompts and annotations
prompts = data_loader.load_prompts('../sample_data/prompts.json')
annotations = data_loader.load_annotations('../sample_data/annotations.json')
category_weights = data_loader.load_category_weights('../sample_data/category_weights.json')

print(f"Loaded {len(prompts)} prompts")
print(f"Loaded {len(annotations)} annotations")
print(f"Loaded weights for {len(category_weights)} categories")

# Display sample prompt
print("\n=== Sample Prompt ===")
sample_prompt = list(prompts.values())[0]
print(f"ID: {sample_prompt['id']}")
print(f"Category: {sample_prompt['category']}")
print(f"Prompt: {sample_prompt['prompt'][:200]}...")

# Display sample annotation
print("\n=== Sample Annotation ===")
sample_annotation = list(annotations.values())[0]
print(f"Prompt ID: {sample_annotation['prompt_id']}")
print(f"Human Rating: {sample_annotation['human_rating']}")
print(f"Explanation: {sample_annotation['explanation'][:200]}...")

In [None]:
# Analyze data structure and completeness
def analyze_data_structure(prompts: Dict, annotations: Dict, category_weights: Dict) -> Dict:
    """Analyze the structure and completeness of the dataset."""
    
    # Extract categories from prompts
    categories = set(prompt['category'] for prompt in prompts.values())
    
    # Count prompts per category
    category_counts = {}
    for prompt in prompts.values():
        cat = prompt['category']
        category_counts[cat] = category_counts.get(cat, 0) + 1
    
    # Check annotation coverage
    annotated_prompts = set(ann['prompt_id'] for ann in annotations.values())
    all_prompts = set(prompts.keys())
    
    # Analyze human ratings distribution
    ratings = [ann['human_rating'] for ann in annotations.values()]
    
    return {
        'total_prompts': len(prompts),
        'total_annotations': len(annotations),
        'categories': categories,
        'category_counts': category_counts,
        'annotation_coverage': len(annotated_prompts) / len(all_prompts),
        'missing_annotations': all_prompts - annotated_prompts,
        'rating_distribution': pd.Series(ratings).value_counts().sort_index(),
        'category_weights': category_weights
    }

analysis = analyze_data_structure(prompts, annotations, category_weights)

print("=== Dataset Structure Analysis ===")
print(f"Total prompts: {analysis['total_prompts']}")
print(f"Total annotations: {analysis['total_annotations']}")
print(f"Annotation coverage: {analysis['annotation_coverage']:.2%}")
print(f"Categories: {sorted(analysis['categories'])}")

print("\n=== Prompts per Category ===")
for cat, count in sorted(analysis['category_counts'].items()):
    print(f"{cat}: {count}")

print("\n=== Human Rating Distribution ===")
print(analysis['rating_distribution'])

print("\n=== Category Weights ===")
for cat, weight in sorted(analysis['category_weights'].items()):
    print(f"{cat}: {weight}")

## 2. Data Quality Checks

In [None]:
def perform_data_quality_checks(prompts: Dict, annotations: Dict) -> Dict:
    """Perform comprehensive data quality checks."""
    
    issues = []
    
    # Check for missing prompt IDs in annotations
    prompt_ids = set(prompts.keys())
    annotation_prompt_ids = set(ann['prompt_id'] for ann in annotations.values())
    
    missing_annotations = prompt_ids - annotation_prompt_ids
    orphaned_annotations = annotation_prompt_ids - prompt_ids
    
    if missing_annotations:
        issues.append(f"Missing annotations for {len(missing_annotations)} prompts")
    
    if orphaned_annotations:
        issues.append(f"Orphaned annotations for {len(orphaned_annotations)} non-existent prompts")
    
    # Check for duplicate prompt IDs
    prompt_id_counts = {}
    for prompt in prompts.values():
        pid = prompt['id']
        prompt_id_counts[pid] = prompt_id_counts.get(pid, 0) + 1
    
    duplicates = {pid: count for pid, count in prompt_id_counts.items() if count > 1}
    if duplicates:
        issues.append(f"Duplicate prompt IDs: {duplicates}")
    
    # Check for empty or missing fields
    empty_prompts = [p['id'] for p in prompts.values() if not p.get('prompt', '').strip()]
    empty_categories = [p['id'] for p in prompts.values() if not p.get('category', '').strip()]
    
    if empty_prompts:
        issues.append(f"Empty prompts: {empty_prompts}")
    
    if empty_categories:
        issues.append(f"Empty categories: {empty_categories}")
    
    # Check rating validity
    invalid_ratings = []
    for ann in annotations.values():
        rating = ann.get('human_rating')
        if rating is None or not isinstance(rating, (int, float)) or rating < 1 or rating > 5:
            invalid_ratings.append(ann['prompt_id'])
    
    if invalid_ratings:
        issues.append(f"Invalid ratings (not 1-5): {invalid_ratings}")
    
    # Check prompt length distribution
    prompt_lengths = [len(p['prompt']) for p in prompts.values()]
    very_short = [p['id'] for p in prompts.values() if len(p['prompt']) < 10]
    very_long = [p['id'] for p in prompts.values() if len(p['prompt']) > 1000]
    
    return {
        'issues': issues,
        'missing_annotations': missing_annotations,
        'orphaned_annotations': orphaned_annotations,
        'prompt_length_stats': {
            'mean': np.mean(prompt_lengths),
            'median': np.median(prompt_lengths),
            'std': np.std(prompt_lengths),
            'min': min(prompt_lengths),
            'max': max(prompt_lengths),
            'very_short': very_short,
            'very_long': very_long
        }
    }

quality_report = perform_data_quality_checks(prompts, annotations)

print("=== Data Quality Report ===")
if quality_report['issues']:
    print("\n⚠️  Issues Found:")
    for issue in quality_report['issues']:
        print(f"  - {issue}")
else:
    print("✅ No data quality issues found!")

print("\n=== Prompt Length Statistics ===")
stats = quality_report['prompt_length_stats']
print(f"Mean length: {stats['mean']:.1f} characters")
print(f"Median length: {stats['median']:.1f} characters")
print(f"Standard deviation: {stats['std']:.1f}")
print(f"Range: {stats['min']} - {stats['max']} characters")

if stats['very_short']:
    print(f"Very short prompts (<10 chars): {stats['very_short']}")
if stats['very_long']:
    print(f"Very long prompts (>1000 chars): {stats['very_long']}")

## 3. Stereotype Rate Visualization

In [None]:
# Create visualizations for stereotype rates
def create_stereotype_rate_visualization(prompts: Dict, annotations: Dict):
    """Create comprehensive visualizations of stereotype rates."""
    
    # Prepare data for analysis
    data = []
    for prompt_id, prompt in prompts.items():
        if prompt_id in annotations:
            ann = annotations[prompt_id]
            data.append({
                'prompt_id': prompt_id,
                'category': prompt['category'],
                'human_rating': ann['human_rating'],
                'is_stereotypical': ann['human_rating'] >= 3,  # Rating 3+ considered stereotypical
                'severity': ann['human_rating'] if ann['human_rating'] >= 3 else 0,
                'prompt_length': len(prompt['prompt'])
            })
    
    df = pd.DataFrame(data)
    
    # Create subplot layout
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    fig.suptitle('Stereotype Rate Analysis', fontsize=16, fontweight='bold')
    
    # 1. Overall rating distribution
    ax1 = axes[0, 0]
    rating_counts = df['human_rating'].value_counts().sort_index()
    colors = ['green' if r < 3 else 'orange' if r == 3 else 'red' for r in rating_counts.index]
    rating_counts.plot(kind='bar', ax=ax1, color=colors, alpha=0.7)
    ax1.set_title('Distribution of Human Ratings')
    ax1.set_xlabel('Rating (1=Not Stereotypical, 5=Highly Stereotypical)')
    ax1.set_ylabel('Count')
    ax1.tick_params(axis='x', rotation=0)
    
    # Add percentage labels
    total = len(df)
    for i, (rating, count) in enumerate(rating_counts.items()):
        percentage = count / total * 100
        ax1.text(i, count + 0.5, f'{percentage:.1f}%', ha='center', va='bottom')
    
    # 2. Stereotype rate by category
    ax2 = axes[0, 1]
    category_stats = df.groupby('category').agg({
        'is_stereotypical': ['count', 'sum', 'mean'],
        'human_rating': 'mean'
    }).round(3)
    
    category_stats.columns = ['total', 'stereotypical', 'stereotype_rate', 'avg_rating']
    category_stats['stereotype_rate'].plot(kind='bar', ax=ax2, color='skyblue', alpha=0.7)
    ax2.set_title('Stereotype Rate by Category')
    ax2.set_xlabel('Category')
    ax2.set_ylabel('Stereotype Rate')
    ax2.tick_params(axis='x', rotation=45)
    ax2.set_ylim(0, 1)
    
    # Add percentage labels
    for i, (cat, rate) in enumerate(category_stats['stereotype_rate'].items()):
        ax2.text(i, rate + 0.02, f'{rate:.1%}', ha='center', va='bottom')
    
    # 3. Rating distribution by category (heatmap)
    ax3 = axes[1, 0]
    rating_by_category = df.groupby(['category', 'human_rating']).size().unstack(fill_value=0)
    rating_by_category_pct = rating_by_category.div(rating_by_category.sum(axis=1), axis=0)
    
    sns.heatmap(rating_by_category_pct, annot=True, fmt='.2f', cmap='RdYlBu_r', 
                ax=ax3, cbar_kws={'label': 'Proportion'})
    ax3.set_title('Rating Distribution by Category')
    ax3.set_xlabel('Human Rating')
    ax3.set_ylabel('Category')
    
    # 4. Severity distribution for stereotypical responses
    ax4 = axes[1, 1]
    stereotypical_df = df[df['is_stereotypical']]
    if not stereotypical_df.empty:
        stereotypical_df['human_rating'].hist(bins=3, alpha=0.7, color='coral', ax=ax4)
        ax4.set_title('Severity Distribution (Stereotypical Responses Only)')
        ax4.set_xlabel('Rating')
        ax4.set_ylabel('Frequency')
        ax4.set_xticks([3, 4, 5])
        
        # Add mean line
        mean_severity = stereotypical_df['human_rating'].mean()
        ax4.axvline(mean_severity, color='red', linestyle='--', 
                   label=f'Mean: {mean_severity:.2f}')
        ax4.legend()
    else:
        ax4.text(0.5, 0.5, 'No stereotypical responses found', 
                ha='center', va='center', transform=ax4.transAxes)
    
    plt.tight_layout()
    plt.show()
    
    return df, category_stats

df, category_stats = create_stereotype_rate_visualization(prompts, annotations)

print("\n=== Category Statistics ===")
print(category_stats.round(3))

## 4. Advanced Statistical Analysis

In [None]:
# Perform advanced statistical analysis
from scipy import stats
from scipy.stats import chi2_contingency, kruskal

def perform_statistical_analysis(df: pd.DataFrame) -> Dict:
    """Perform statistical tests and analysis."""
    
    results = {}
    
    # 1. Test for differences in stereotype rates across categories
    contingency_table = pd.crosstab(df['category'], df['is_stereotypical'])
    chi2, p_value, dof, expected = chi2_contingency(contingency_table)
    
    results['chi2_test'] = {
        'statistic': chi2,
        'p_value': p_value,
        'degrees_of_freedom': dof,
        'significant': p_value < 0.05
    }
    
    # 2. Kruskal-Wallis test for rating differences across categories
    category_groups = [group['human_rating'].values for name, group in df.groupby('category')]
    if len(category_groups) > 1:
        kw_stat, kw_p = kruskal(*category_groups)
        results['kruskal_wallis'] = {
            'statistic': kw_stat,
            'p_value': kw_p,
            'significant': kw_p < 0.05
        }
    
    # 3. Correlation analysis
    correlation_matrix = df[['human_rating', 'prompt_length']].corr()
    results['correlations'] = correlation_matrix
    
    # 4. Summary statistics by category
    category_summary = df.groupby('category')['human_rating'].agg([
        'count', 'mean', 'std', 'min', 'max', 'median'
    ]).round(3)
    results['category_summary'] = category_summary
    
    # 5. Overall statistics
    overall_stats = {
        'total_responses': len(df),
        'stereotypical_responses': df['is_stereotypical'].sum(),
        'stereotype_rate': df['is_stereotypical'].mean(),
        'mean_rating': df['human_rating'].mean(),
        'std_rating': df['human_rating'].std(),
        'mean_severity': df[df['is_stereotypical']]['human_rating'].mean() if df['is_stereotypical'].any() else 0
    }
    results['overall_stats'] = overall_stats
    
    return results

stats_results = perform_statistical_analysis(df)

print("=== Statistical Analysis Results ===")

print("\n1. Chi-square Test (Category Independence):")
chi2_result = stats_results['chi2_test']
print(f"   Chi-square statistic: {chi2_result['statistic']:.4f}")
print(f"   P-value: {chi2_result['p_value']:.4f}")
print(f"   Significant: {chi2_result['significant']}")
print(f"   → {'Categories show significant differences' if chi2_result['significant'] else 'No significant differences between categories'}")

if 'kruskal_wallis' in stats_results:
    print("\n2. Kruskal-Wallis Test (Rating Differences):")
    kw_result = stats_results['kruskal_wallis']
    print(f"   Test statistic: {kw_result['statistic']:.4f}")
    print(f"   P-value: {kw_result['p_value']:.4f}")
    print(f"   Significant: {kw_result['significant']}")
    print(f"   → {'Significant rating differences across categories' if kw_result['significant'] else 'No significant rating differences'}")

print("\n3. Overall Statistics:")
overall = stats_results['overall_stats']
print(f"   Total responses: {overall['total_responses']}")
print(f"   Stereotypical responses: {overall['stereotypical_responses']}")
print(f"   Overall stereotype rate: {overall['stereotype_rate']:.2%}")
print(f"   Mean rating: {overall['mean_rating']:.2f} ± {overall['std_rating']:.2f}")
print(f"   Mean severity (stereotypical only): {overall['mean_severity']:.2f}")

print("\n4. Category Summary:")
print(stats_results['category_summary'])

print("\n5. Correlation Matrix:")
print(stats_results['correlations'])

## 5. Distribution Analysis and Outlier Detection

In [None]:
# Analyze distributions and detect outliers
def analyze_distributions(df: pd.DataFrame):
    """Analyze rating distributions and detect outliers."""
    
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    fig.suptitle('Distribution Analysis', fontsize=16, fontweight='bold')
    
    # 1. Box plot of ratings by category
    ax1 = axes[0, 0]
    df.boxplot(column='human_rating', by='category', ax=ax1)
    ax1.set_title('Rating Distribution by Category')
    ax1.set_xlabel('Category')
    ax1.set_ylabel('Human Rating')
    ax1.tick_params(axis='x', rotation=45)
    
    # 2. Violin plot for detailed distribution shape
    ax2 = axes[0, 1]
    sns.violinplot(data=df, x='category', y='human_rating', ax=ax2)
    ax2.set_title('Rating Distribution Shape by Category')
    ax2.set_xlabel('Category')
    ax2.set_ylabel('Human Rating')
    ax2.tick_params(axis='x', rotation=45)
    
    # 3. Prompt length distribution
    ax3 = axes[1, 0]
    df['prompt_length'].hist(bins=20, alpha=0.7, color='lightblue', ax=ax3)
    ax3.set_title('Prompt Length Distribution')
    ax3.set_xlabel('Prompt Length (characters)')
    ax3.set_ylabel('Frequency')
    ax3.axvline(df['prompt_length'].mean(), color='red', linestyle='--', 
               label=f'Mean: {df["prompt_length"].mean():.0f}')
    ax3.legend()
    
    # 4. Scatter plot: prompt length vs rating
    ax4 = axes[1, 1]
    scatter = ax4.scatter(df['prompt_length'], df['human_rating'], 
                         c=df['is_stereotypical'], cmap='RdYlBu_r', alpha=0.6)
    ax4.set_title('Prompt Length vs Rating')
    ax4.set_xlabel('Prompt Length (characters)')
    ax4.set_ylabel('Human Rating')
    plt.colorbar(scatter, ax=ax4, label='Stereotypical (1=Yes, 0=No)')
    
    # Add trend line
    z = np.polyfit(df['prompt_length'], df['human_rating'], 1)
    p = np.poly1d(z)
    ax4.plot(df['prompt_length'], p(df['prompt_length']), "r--", alpha=0.8)
    
    plt.tight_layout()
    plt.show()
    
    # Detect outliers using IQR method
    outliers = {}
    for category in df['category'].unique():
        cat_data = df[df['category'] == category]['human_rating']
        Q1 = cat_data.quantile(0.25)
        Q3 = cat_data.quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        
        cat_outliers = df[(df['category'] == category) & 
                         ((df['human_rating'] < lower_bound) | 
                          (df['human_rating'] > upper_bound))]
        
        if not cat_outliers.empty:
            outliers[category] = cat_outliers[['prompt_id', 'human_rating']].to_dict('records')
    
    return outliers

outliers = analyze_distributions(df)

print("\n=== Outlier Detection Results ===")
if outliers:
    for category, category_outliers in outliers.items():
        print(f"\n{category}: {len(category_outliers)} outliers")
        for outlier in category_outliers:
            print(f"  - Prompt {outlier['prompt_id']}: Rating {outlier['human_rating']}")
else:
    print("No outliers detected using IQR method.")

## 6. Export Results and Summary

In [None]:
# Create comprehensive summary report
def create_summary_report(df: pd.DataFrame, stats_results: Dict, outliers: Dict) -> Dict:
    """Create a comprehensive summary report."""
    
    report = {
        'dataset_overview': {
            'total_prompts': len(df),
            'categories': df['category'].nunique(),
            'category_list': sorted(df['category'].unique().tolist()),
            'date_generated': pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S')
        },
        'stereotype_analysis': {
            'overall_stereotype_rate': stats_results['overall_stats']['stereotype_rate'],
            'stereotypical_responses': stats_results['overall_stats']['stereotypical_responses'],
            'mean_rating': stats_results['overall_stats']['mean_rating'],
            'mean_severity': stats_results['overall_stats']['mean_severity'],
            'category_breakdown': category_stats.to_dict('index')
        },
        'statistical_tests': {
            'chi2_test': stats_results['chi2_test'],
            'kruskal_wallis': stats_results.get('kruskal_wallis', {})
        },
        'data_quality': {
            'outliers_detected': len(outliers) > 0,
            'outlier_categories': list(outliers.keys()),
            'total_outliers': sum(len(cat_outliers) for cat_outliers in outliers.values())
        },
        'recommendations': []
    }
    
    # Generate recommendations based on analysis
    recommendations = []
    
    if stats_results['chi2_test']['significant']:
        recommendations.append("Categories show significant differences in stereotype rates - consider category-specific analysis")
    
    if stats_results['overall_stats']['stereotype_rate'] < 0.1:
        recommendations.append("Low overall stereotype rate - dataset may need more diverse examples")
    elif stats_results['overall_stats']['stereotype_rate'] > 0.5:
        recommendations.append("High overall stereotype rate - consider balancing with non-stereotypical examples")
    
    if outliers:
        recommendations.append("Outliers detected - review these cases for data quality issues")
    
    unbalanced_categories = [cat for cat, stats in category_stats.iterrows() 
                           if stats['total'] < 10]
    if unbalanced_categories:
        recommendations.append(f"Categories with few samples: {unbalanced_categories} - consider collecting more data")
    
    report['recommendations'] = recommendations
    
    return report

summary_report = create_summary_report(df, stats_results, outliers)

print("=== COMPREHENSIVE SUMMARY REPORT ===")
print(f"Generated: {summary_report['dataset_overview']['date_generated']}")
print(f"\nDataset: {summary_report['dataset_overview']['total_prompts']} prompts across {summary_report['dataset_overview']['categories']} categories")
print(f"Overall stereotype rate: {summary_report['stereotype_analysis']['overall_stereotype_rate']:.2%}")
print(f"Mean rating: {summary_report['stereotype_analysis']['mean_rating']:.2f}")
print(f"Mean severity (stereotypical only): {summary_report['stereotype_analysis']['mean_severity']:.2f}")

print("\nStatistical Significance:")
print(f"- Categories differ significantly: {summary_report['statistical_tests']['chi2_test']['significant']}")
if 'kruskal_wallis' in summary_report['statistical_tests']:
    print(f"- Rating distributions differ: {summary_report['statistical_tests']['kruskal_wallis']['significant']}")

print("\nData Quality:")
print(f"- Outliers detected: {summary_report['data_quality']['outliers_detected']}")
if summary_report['data_quality']['outliers_detected']:
    print(f"- Total outliers: {summary_report['data_quality']['total_outliers']}")
    print(f"- Affected categories: {summary_report['data_quality']['outlier_categories']}")

print("\nRecommendations:")
for i, rec in enumerate(summary_report['recommendations'], 1):
    print(f"{i}. {rec}")

# Save processed data for use in other notebooks
df.to_csv('../data/processed_evaluation_data.csv', index=False)
print(f"\n✅ Processed data saved to ../data/processed_evaluation_data.csv")

# Save summary report
with open('../data/data_exploration_summary.json', 'w') as f:
    json.dump(summary_report, f, indent=2, default=str)
print(f"✅ Summary report saved to ../data/data_exploration_summary.json")

## Conclusion

This notebook provided a comprehensive exploration of the StereoWipe evaluation dataset, including:

1. **Data Structure Analysis**: Examined the completeness and structure of prompts, annotations, and category weights
2. **Quality Assessment**: Performed thorough data quality checks to identify potential issues
3. **Stereotype Rate Analysis**: Visualized stereotype rates across categories and analyzed patterns
4. **Statistical Analysis**: Conducted significance tests and correlation analysis
5. **Distribution Analysis**: Examined rating distributions and detected outliers
6. **Summary Report**: Generated actionable insights and recommendations

The analysis provides a solid foundation for subsequent notebooks focusing on human-LLM agreement, arena analysis, and bias category deep dives.

### Key Findings:
- Dataset contains comprehensive coverage across multiple bias categories
- Statistical tests reveal significant patterns in stereotype rates
- Data quality is generally good with minimal outliers
- Clear recommendations for dataset improvements and analysis directions

### Next Steps:
1. Use processed data in subsequent analysis notebooks
2. Implement recommendations for data quality improvements
3. Conduct deeper category-specific analyses
4. Compare with LLM judge results in human-LLM agreement analysis