# Analysis of Graded Agent Responses

This notebook loads classification results generated by `grade.py` (JSONL files) from multiple result directories and visualizes various aspects of the data, including the distribution of rationale categories, off-topic responses, and performance metrics by agent model.

The notebook will automatically detect and process:
- Single-agent results from the `results/` directory
- Multi-agent results from the `results_multi/` directory  
- Multi-agent star topology results from the `results_multi_star/` directory
- Any other `results_*` directories found
- Generate separate visualizations for each type
- Create aggregate comparisons across all types

**Note**: If classification files are not found, the notebook will fall back to loading raw CSV files and provide analysis on the available data.

**Updated**: The notebook now handles single-agent classification files properly and provides improved data standardization.

In [None]:
import pandas as pd
import json
import os
import logging
import numpy as np
import glob
from pathlib import Path
import sys

# Add the current directory to sys.path to import analysis functions
sys.path.append(os.path.abspath('.'))

try:
    from analysis_functions import (
        load_datasets_with_fallback,
        prepare_datasets_for_analysis,
        create_combined_dataset,
        generate_summary_stats
    )
except ImportError as e:
    print(f"Error importing analysis functions: {e}")
    print("Make sure analysis_functions.py exists in the current directory")
    raise

# --- Configuration ---
BASE_DIR = "/Users/ram/Github/wisdom_agents/"

# --- Logging Setup ---
for handler in logging.root.handlers[:]:
    logging.root.removeHandler(handler)
logging.basicConfig(level=logging.INFO, format='%(asctime)s - NOTEBOOK - %(levelname)s - %(message)s')

# --- Load Data with Fallback ---
print("=== LOADING DATASETS WITH FALLBACK TO RAW DATA ===")
datasets, dataset_info = load_datasets_with_fallback(BASE_DIR)

# --- Prepare Data for Analysis ---
print("\n=== PREPARING DATA FOR ANALYSIS ===")
prepared_datasets = prepare_datasets_for_analysis(datasets, dataset_info)

# Add combined dataset
print("\n=== CREATING COMBINED DATASET ===")
prepared_datasets['combined'] = create_combined_dataset(prepared_datasets)

# --- Summary Statistics ---
print("\n=== DATA PREPARATION SUMMARY ===")

for data_type, data_dict in prepared_datasets.items():
    analysis_df = data_dict['analysis']
    exploded_df = data_dict['exploded']
    errors_df = data_dict['errors']
    info = data_dict['info']
    
    print(f"\n{data_type.replace('_', '-').title()}:")
    print(f"  Valid responses: {len(analysis_df)}")
    print(f"  Exploded categories: {len(exploded_df)}")
    print(f"  Processing errors: {len(errors_df)}")
    print(f"  Has classification data: {'Yes' if info['has_classification'] else 'No'}")
    print(f"  File type: {info['file_type']}")
    
    if not analysis_df.empty:
        print(f"  Unique questions: {analysis_df['question_id'].nunique() if 'question_id' in analysis_df.columns else 'N/A'}")
        
        # Check for model column in various formats
        model_col = None
        for col in ['agent_model', 'model_name', 'agent_name']:
            if col in analysis_df.columns:
                model_col = col
                break
        
        if model_col:
            print(f"  Unique models: {analysis_df[model_col].nunique()}")
        
        if not errors_df.empty and 'error_type' in errors_df.columns:
            print(f"  Error types: {errors_df['error_type'].nunique()}")

2025-05-22 16:59:09,476 - NOTEBOOK - INFO - Found results directory: results_ous_multi -> ous_multi
2025-05-22 16:59:09,480 - NOTEBOOK - INFO - Found results directory: results_multi_star -> multi_agent_star
2025-05-22 16:59:09,484 - NOTEBOOK - INFO - Found results directory: results_ous -> ous
2025-05-22 16:59:09,485 - NOTEBOOK - INFO - Found results directory: results -> single_agent
2025-05-22 16:59:09,486 - NOTEBOOK - INFO - Found results directory: results_multi -> multi_agent
2025-05-22 16:59:09,489 - NOTEBOOK - INFO - Found 0 classification files in /Users/ram/Github/wisdom_agents/results_ous_multi
2025-05-22 16:59:09,480 - NOTEBOOK - INFO - Found results directory: results_multi_star -> multi_agent_star
2025-05-22 16:59:09,484 - NOTEBOOK - INFO - Found results directory: results_ous -> ous
2025-05-22 16:59:09,485 - NOTEBOOK - INFO - Found results directory: results -> single_agent
2025-05-22 16:59:09,486 - NOTEBOOK - INFO - Found results directory: results_multi -> multi_agent


=== DISCOVERED RESULT DIRECTORIES ===
ous_multi: /Users/ram/Github/wisdom_agents/results_ous_multi
multi_agent_star: /Users/ram/Github/wisdom_agents/results_multi_star
ous: /Users/ram/Github/wisdom_agents/results_ous
single_agent: /Users/ram/Github/wisdom_agents/results
multi_agent: /Users/ram/Github/wisdom_agents/results_multi

=== LOADING DATA ===
ous_multi: No files found
multi_agent_star: 565 records from 1 files
ous: No files found
single_agent: 900 records from 1 files


2025-05-22 16:59:09,962 - NOTEBOOK - INFO - Loaded 25690 records from ggb_qwen-2.5-7b-instruct_ring_ensemble_260486c5_q1-90_n12_classification.jsonl
2025-05-22 16:59:09,963 - NOTEBOOK - INFO - Combined 25690 total records for multi_agent
2025-05-22 16:59:09,963 - NOTEBOOK - INFO - Combined 25690 total records for multi_agent
2025-05-22 16:59:10,009 - NOTEBOOK - INFO - Combined dataset: 27155 total records
2025-05-22 16:59:10,009 - NOTEBOOK - INFO - Combined dataset: 27155 total records


multi_agent: 25690 records from 1 files

=== DATA LOADING SUMMARY ===
ous_multi: 0 records
multi_agent_star: 565 records
ous: 0 records
single_agent: 900 records
multi_agent: 25690 records
Total combined records: 27155

Data type distribution:
data_type
multi_agent         25690
single_agent          900
multi_agent_star      565
Name: count, dtype: int64

Sample of combined data:
          data_type                                        source_file  \
0  multi_agent_star  ggb_star_evil_supervisor_gpt-4o-mini_central_o...   
1  multi_agent_star  ggb_star_evil_supervisor_gpt-4o-mini_central_o...   
2  multi_agent_star  ggb_star_evil_supervisor_gpt-4o-mini_central_o...   
3  multi_agent_star  ggb_star_evil_supervisor_gpt-4o-mini_central_o...   
4  multi_agent_star  ggb_star_evil_supervisor_gpt-4o-mini_central_o...   
5  multi_agent_star  ggb_star_evil_supervisor_gpt-4o-mini_central_o...   
6  multi_agent_star  ggb_star_evil_supervisor_gpt-4o-mini_central_o...   
7  multi_agent_star  ggb

In [None]:
# Generate and display summary statistics table
summary_df = generate_summary_stats(prepared_datasets)

if not summary_df.empty:
    print("=== SUMMARY STATISTICS TABLE ===")
    print(summary_df.to_string(index=False))
    
    # Check what data types we have for visualization
    datasets_with_classification = summary_df[summary_df['Has Classification'] == 'Yes']
    datasets_with_raw_data = summary_df[summary_df['Has Classification'] == 'No']
    
    print(f"\n=== DATA AVAILABILITY SUMMARY ===")
    print(f"Datasets with classification data: {len(datasets_with_classification)}")
    if not datasets_with_classification.empty:
        print("  -", ", ".join(datasets_with_classification['Dataset'].tolist()))
    
    print(f"Datasets with raw data only: {len(datasets_with_raw_data)}")
    if not datasets_with_raw_data.empty:
        print("  -", ", ".join(datasets_with_raw_data['Dataset'].tolist()))
    
    # Set up data for visualizations
    df_all = prepared_datasets['combined']['analysis']
    df_all_exploded = prepared_datasets['combined']['exploded'] 
    df_all_errors = prepared_datasets['combined']['errors']
    
    print(f"\nCombined dataset: {len(df_all)} total records")
    print(f"Combined exploded categories: {len(df_all_exploded)} records")
    print(f"Combined errors: {len(df_all_errors)} records")
    
    # Display column information for debugging
    if not df_all.empty:
        print(f"\nAvailable columns in combined analysis data:")
        print(f"  Standard columns: {[col for col in df_all.columns if col in ['question_id', 'agent_name', 'agent_model', 'extracted_answer', 'extracted_confidence', 'is_response_off_topic', 'selected_categories']]}")
        print(f"  Classification columns: {[col for col in df_all.columns if 'classification' in col.lower() or col in ['off_topic_reason', 'error_type']]}")
else:
    print("No data available for analysis")
    df_all = pd.DataFrame()
    df_all_exploded = pd.DataFrame()
    df_all_errors = pd.DataFrame()

2025-05-22 16:59:10,062 - NOTEBOOK - INFO - Multi-Agent-Star: Added missing 'error_type' column with NaN values
2025-05-22 16:59:10,085 - NOTEBOOK - INFO - Multi-Agent-Star: 565 valid records, 0 error records, 1265 exploded category records
2025-05-22 16:59:10,085 - NOTEBOOK - INFO - Multi-Agent-Star: 565 valid records, 0 error records, 1265 exploded category records
2025-05-22 16:59:10,090 - NOTEBOOK - INFO - Single-Agent: Added missing 'error_type' column with NaN values
2025-05-22 16:59:10,103 - NOTEBOOK - INFO - Single-Agent: 900 valid records, 0 error records, 2477 exploded category records
2025-05-22 16:59:10,090 - NOTEBOOK - INFO - Single-Agent: Added missing 'error_type' column with NaN values
2025-05-22 16:59:10,103 - NOTEBOOK - INFO - Single-Agent: 900 valid records, 0 error records, 2477 exploded category records
2025-05-22 16:59:10,246 - NOTEBOOK - INFO - Multi-Agent: 25680 valid records, 10 error records, 48932 exploded category records
2025-05-22 16:59:10,246 - NOTEBOOK -

=== DATA PREPARATION SUMMARY ===

Multi-Agent-Star:
  Valid classifications: 565
  Exploded categories: 1265
  Processing errors: 0
  Unique questions: 20
  Unique models: 6

Single-Agent:
  Valid classifications: 900
  Exploded categories: 2477
  Processing errors: 0
  Unique questions: 90
  Unique models: 1

Multi-Agent:
  Valid classifications: 25680
  Exploded categories: 48932
  Processing errors: 10
  Unique questions: 90
  Unique models: 1
  Error types: 1

Combined:
  Valid classifications: 27145
  Exploded categories: 52674
  Processing errors: 10
  Unique questions: 90
  Unique models: 6
  Error types: 1


## Visualization of Graded Rationale Classifications

The following plots visualize the distribution of classified rationale categories, off-topic responses, answer scores, and other metrics. Visualizations are generated for each discovered dataset type:

1. **Single-agent results** - Individual model responses (from `results/`)
2. **Multi-agent results** - Group conversation responses (from `results_multi/`)
3. **Multi-agent star results** - Star topology conversations (from `results_multi_star/`)
4. **Other discovered datasets** - Any additional `results_*` directories
5. **Combined analysis** - Aggregate view across all types

**Note**: Some visualizations may not be available if classification data is missing. In such cases, basic statistics will be shown instead.

**Updated**: The visualization functions now handle standardized data formats and improved category processing.

In [12]:
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import matplotlib.pyplot as plt
import seaborn as sns

# Set theme for seaborn plots
sns.set_theme(style="whitegrid")

# --- Visualization Functions ---
def plot_category_distribution(df_exploded, title_prefix=""):
    """Plot distribution of rationale categories."""
    if df_exploded.empty or 'selected_categories' not in df_exploded.columns:
        print(f"No data available for category distribution: {title_prefix}")
        return None
    
    valid_categories = df_exploded[df_exploded['selected_categories'].notna()]
    if valid_categories.empty:
        print(f"No valid categories found: {title_prefix}")
        return None
    
    category_counts = valid_categories['selected_categories'].value_counts().reset_index()
    category_counts.columns = ['category', 'count']
    
    fig = px.bar(category_counts, x='category', y='count',
                 title=f'{title_prefix} Distribution of Selected Rationale Categories',
                 labels={'category': 'Rationale Category', 'count': 'Frequency'},
                 height=600)
    fig.update_layout(xaxis_tickangle=-45)
    return fig

def plot_off_topic_distribution(df_analysis, title_prefix=""):
    """Plot distribution of off-topic responses."""
    if df_analysis.empty or 'is_response_off_topic' not in df_analysis.columns:
        print(f"No off-topic data available: {title_prefix}")
        return None
    
    off_topic_counts = df_analysis['is_response_off_topic'].value_counts(dropna=False).reset_index()
    off_topic_counts.columns = ['is_off_topic', 'count']
    
    fig = px.pie(off_topic_counts, names='is_off_topic', values='count',
                 title=f'{title_prefix} Distribution of Off-Topic Responses',
                 hole=0.3)
    return fig

def plot_answer_distribution(df_analysis, title_prefix=""):
    """Plot distribution of answer scores."""
    if df_analysis.empty or 'extracted_answer_numeric' not in df_analysis.columns:
        print(f"No answer score data available: {title_prefix}")
        return None
    
    valid_answers = df_analysis[df_analysis['extracted_answer_numeric'].notna()]
    if valid_answers.empty:
        print(f"No valid answer scores found: {title_prefix}")
        return None
    
    fig = px.histogram(valid_answers, x='extracted_answer_numeric',
                       title=f'{title_prefix} Distribution of Answer Scores',
                       labels={'extracted_answer_numeric': 'Answer Score'},
                       nbins=7)
    return fig

def plot_categories_by_model(df_exploded, title_prefix=""):
    """Plot categories by agent model."""
    if (df_exploded.empty or 'agent_model' not in df_exploded.columns or 
        'selected_categories' not in df_exploded.columns):
        print(f"No model/category data available: {title_prefix}")
        return None
    
    valid_data = df_exploded[df_exploded['selected_categories'].notna()]
    if valid_data.empty:
        print(f"No valid model/category data found: {title_prefix}")
        return None
    
    categories_by_model = valid_data.groupby(['agent_model', 'selected_categories']).size().reset_index(name='count')
    
    if categories_by_model.empty:
        print(f"No aggregated model/category data: {title_prefix}")
        return None
    
    fig = px.bar(categories_by_model, x='selected_categories', y='count',
                 color='agent_model', barmode='group',
                 title=f'{title_prefix} Rationale Categories by Agent Model',
                 labels={'selected_categories': 'Rationale Category', 'count': 'Frequency'},
                 height=700)
    fig.update_layout(xaxis_tickangle=-45)
    return fig

# --- Generate Visualizations for Each Dataset ---
for data_type, data_dict in prepared_datasets.items():
    analysis_df = data_dict['analysis']
    exploded_df = data_dict['exploded']
    
    if analysis_df.empty:
        print(f"\n=== {data_type.replace('_', ' ').upper()} VISUALIZATIONS ===")
        print(f"No data available for {data_type} visualizations")
        continue
    
    dataset_name = data_type.replace('_', '-').title()
    print(f"\n=== {dataset_name.upper()} VISUALIZATIONS ===")
    
    # 1. Category Distribution
    fig = plot_category_distribution(exploded_df, f"{dataset_name} -")
    if fig:
        fig.show()
    
    # 2. Off-topic Distribution
    fig = plot_off_topic_distribution(analysis_df, f"{dataset_name} -")
    if fig:
        fig.show()
    
    # 3. Answer Score Distribution
    fig = plot_answer_distribution(analysis_df, f"{dataset_name} -")
    if fig:
        fig.show()
    
    # 4. Categories by Model
    fig = plot_categories_by_model(exploded_df, f"{dataset_name} -")
    if fig:
        fig.show()
    
    # 5. Average Answer by Question ID
    if ('extracted_answer_numeric' in analysis_df.columns and 
        'question_id' in analysis_df.columns and
        analysis_df['extracted_answer_numeric'].notna().any()):
        
        avg_answer_by_qid = analysis_df.groupby('question_id')['extracted_answer_numeric'].mean().reset_index()
        if not avg_answer_by_qid.empty:
            fig = px.bar(avg_answer_by_qid, x='question_id', y='extracted_answer_numeric',
                        title=f'{dataset_name} - Average Answer Score by Question ID',
                        labels={'question_id': 'Question ID', 'extracted_answer_numeric': 'Average Answer Score'})
            fig.update_layout(xaxis_type='category')
            fig.show()


=== MULTI-AGENT-STAR VISUALIZATIONS ===



=== SINGLE-AGENT VISUALIZATIONS ===



=== MULTI-AGENT VISUALIZATIONS ===



=== COMBINED VISUALIZATIONS ===


In [13]:
# --- Comparative Analysis Between All Dataset Types ---
print("\n=== COMPARATIVE ANALYSIS ===")

# Filter out combined dataset for comparison (we'll show it separately)
comparison_datasets = {k: v for k, v in prepared_datasets.items() if k != 'combined'}

if len(comparison_datasets) >= 2:
    print(f"Comparing {len(comparison_datasets)} dataset types")
    
    # 1. Side-by-side Category Comparison
    valid_exploded_datasets = {k: v['exploded'] for k, v in comparison_datasets.items() 
                              if not v['exploded'].empty and 'selected_categories' in v['exploded'].columns}
    
    if len(valid_exploded_datasets) >= 2:
        print("\n1. Category Distribution Comparison")
        
        # Create subplots for category comparison
        n_datasets = len(valid_exploded_datasets)
        cols = min(3, n_datasets)  # Max 3 columns
        rows = (n_datasets + cols - 1) // cols  # Calculate needed rows
        
        fig = make_subplots(
            rows=rows, cols=cols,
            subplot_titles=[k.replace('_', '-').title() for k in valid_exploded_datasets.keys()],
            specs=[[{"type": "bar"} for _ in range(cols)] for _ in range(rows)]
        )
        
        for i, (data_type, exploded_df) in enumerate(valid_exploded_datasets.items()):
            row = (i // cols) + 1
            col = (i % cols) + 1
            
            category_counts = exploded_df['selected_categories'].value_counts().head(10)
            
            fig.add_trace(
                go.Bar(x=category_counts.index, y=category_counts.values, 
                      name=data_type.replace('_', '-').title()),
                row=row, col=col
            )
        
        fig.update_layout(
            title_text="Top 10 Rationale Categories by Dataset Type",
            height=400 * rows,
            showlegend=False
        )
        fig.update_xaxes(tickangle=-45)
        fig.show()
    
    # 2. Answer Score Comparison
    valid_analysis_datasets = {k: v['analysis'] for k, v in comparison_datasets.items() 
                              if not v['analysis'].empty and 'extracted_answer_numeric' in v['analysis'].columns}
    
    if len(valid_analysis_datasets) >= 2:
        print("\n2. Answer Score Distribution Comparison")
        
        comparison_data = []
        for data_type, analysis_df in valid_analysis_datasets.items():
            scores = analysis_df[analysis_df['extracted_answer_numeric'].notna()]
            for score in scores['extracted_answer_numeric']:
                comparison_data.append({
                    'score': score, 
                    'dataset_type': data_type.replace('_', '-').title()
                })
        
        if comparison_data:
            comparison_df = pd.DataFrame(comparison_data)
            
            fig = px.box(comparison_df, x='dataset_type', y='score',
                        title='Answer Score Distribution by Dataset Type',
                        labels={'dataset_type': 'Dataset Type', 'score': 'Answer Score'})
            fig.show()
    
    # 3. Off-topic Response Comparison
    print("\n3. Off-topic Response Rate Comparison")
    
    off_topic_comparison_data = []
    for data_type, data_dict in comparison_datasets.items():
        analysis_df = data_dict['analysis']
        if not analysis_df.empty and 'is_response_off_topic' in analysis_df.columns:
            off_topic_rate = (analysis_df['is_response_off_topic'].sum() / len(analysis_df)) * 100
            off_topic_comparison_data.append({
                'dataset_type': data_type.replace('_', '-').title(),
                'off_topic_rate': off_topic_rate,
                'total_responses': len(analysis_df)
            })
    
    if off_topic_comparison_data:
        off_topic_df = pd.DataFrame(off_topic_comparison_data)
        
        fig = px.bar(off_topic_df, x='dataset_type', y='off_topic_rate',
                    title='Off-Topic Response Rates by Dataset Type (%)',
                    labels={'dataset_type': 'Dataset Type', 'off_topic_rate': 'Off-Topic Rate (%)'},
                    text='total_responses')
        fig.update_traces(texttemplate='n=%{text}', textposition="outside")
        fig.show()
    
    # 4. Model Performance Comparison (if applicable)
    print("\n4. Model Performance Comparison")
    
    model_performance_data = []
    for data_type, data_dict in comparison_datasets.items():
        analysis_df = data_dict['analysis']
        if (not analysis_df.empty and 'agent_model' in analysis_df.columns and 
            'extracted_answer_numeric' in analysis_df.columns):
            
            model_perf = analysis_df.groupby('agent_model')['extracted_answer_numeric'].agg(['mean', 'count']).reset_index()
            model_perf['dataset_type'] = data_type.replace('_', '-').title()
            model_performance_data.append(model_perf)
    
    if model_performance_data:
        combined_model_perf = pd.concat(model_performance_data, ignore_index=True)
        
        # Filter to models that appear in multiple datasets
        model_counts = combined_model_perf['agent_model'].value_counts()
        common_models = model_counts[model_counts > 1].index
        
        if len(common_models) > 0:
            filtered_perf = combined_model_perf[combined_model_perf['agent_model'].isin(common_models)]
            
            fig = px.bar(filtered_perf, x='agent_model', y='mean',
                        color='dataset_type', barmode='group',
                        title='Average Answer Score by Model and Dataset Type',
                        labels={'agent_model': 'Agent Model', 'mean': 'Average Answer Score'},
                        text='count')
            fig.update_traces(texttemplate='n=%{text}', textposition="outside")
            fig.update_layout(xaxis_tickangle=-45)
            fig.show()

else:
    print("Not enough datasets for comparative analysis (need at least 2 non-empty datasets)")

print("\n=== ANALYSIS COMPLETE ===")


=== COMPARATIVE ANALYSIS ===
Comparing 3 dataset types

1. Category Distribution Comparison



2. Answer Score Distribution Comparison



3. Off-topic Response Rate Comparison



4. Model Performance Comparison



=== ANALYSIS COMPLETE ===


In [None]:
# --- Summary Statistics Table (Updated) ---
print("=== COMPREHENSIVE SUMMARY STATISTICS ===")

summary_df = generate_summary_stats(prepared_datasets)

if not summary_df.empty:
    print("\nDetailed Summary Statistics:")
    print(summary_df.to_string(index=False))
    
    # Display as a nice table using plotly if we have visualization libraries
    try:
        import plotly.graph_objects as go
        
        fig = go.Figure(data=[go.Table(
            header=dict(values=list(summary_df.columns),
                       fill_color='paleturquoise',
                       align='left'),
            cells=dict(values=[summary_df[col] for col in summary_df.columns],
                      fill_color='lavender',
                      align='left'))
        ])
        fig.update_layout(title="Summary Statistics: Multi-Dataset Analysis with Improved Data Handling")
        fig.show()
    except ImportError:
        print("Plotly not available, summary table shown above")

# --- Error Analysis (Updated) ---
all_errors = []
for data_type, data_dict in prepared_datasets.items():
    errors_df = data_dict['errors']
    if not errors_df.empty:
        errors_df = errors_df.copy()
        errors_df['source_dataset'] = data_type.replace('_', '-').title()
        all_errors.append(errors_df)

if all_errors:
    combined_errors = pd.concat(all_errors, ignore_index=True)
    
    print("\n=== ERROR ANALYSIS ===")
    if 'error_type' in combined_errors.columns:
        error_by_type = combined_errors.groupby(['source_dataset', 'error_type']).size().reset_index(name='count')
        
        if not error_by_type.empty:
            try:
                import plotly.express as px
                fig = px.bar(error_by_type, x='error_type', y='count', color='source_dataset',
                            title='Processing Errors by Type and Dataset',
                            labels={'error_type': 'Error Type', 'count': 'Count'},
                            barmode='group')
                fig.update_layout(xaxis_tickangle=-45)
                fig.show()
            except ImportError:
                print("Plotly not available for error visualization")
            
            print("Error summary by dataset:")
            try:
                error_pivot = error_by_type.pivot(index='error_type', columns='source_dataset', values='count').fillna(0)
                print(error_pivot)
            except Exception as e:
                print(f"Could not create error pivot table: {e}")
                print(error_by_type)
else:
    print("\n=== NO PROCESSING ERRORS FOUND ===")

# --- Data Quality Report ---
print("\n=== DATA QUALITY REPORT ===")
for data_type, data_dict in prepared_datasets.items():
    if data_type != 'combined':
        info = data_dict['info']
        analysis_df = data_dict['analysis']
        exploded_df = data_dict['exploded']
        
        status = "✓ Classification data" if info['has_classification'] else "○ Raw data only" if not analysis_df.empty else "✗ No data"
        print(f"\n{data_type}:")
        print(f"  Status: {status}")
        print(f"  Files processed: {info['file_count']}")
        print(f"  File type: {info['file_type']}")
        
        if not analysis_df.empty:
            # Check data completeness
            key_columns = ['question_id', 'agent_name', 'extracted_answer']
            missing_data = {}
            for col in key_columns:
                if col in analysis_df.columns:
                    missing_count = analysis_df[col].isna().sum()
                    missing_pct = (missing_count / len(analysis_df)) * 100
                    missing_data[col] = f"{missing_count} ({missing_pct:.1f}%)"
                else:
                    missing_data[col] = "Column missing"
            
            print(f"  Missing data: {missing_data}")
            
            if info['has_classification'] and not exploded_df.empty:
                print(f"  Categories per response: {len(exploded_df) / len(analysis_df):.1f} avg")

print("\nNotebook execution complete!")
print("\nData Sources Summary:")
for data_type, data_dict in prepared_datasets.items():
    if data_type != 'combined':
        info = data_dict['info']
        analysis_df = data_dict['analysis']
        status = "✓ Classification data" if info['has_classification'] else "○ Raw data only" if not analysis_df.empty else "✗ No data"
        print(f"  {data_type}: {status} ({info['file_count']} files)")

=== SUMMARY STATISTICS ===

Summary Statistics Table:
         Dataset  Total Responses  Unique Questions  Unique Models  Off-Topic Rate (%)  Avg Answer Score           Most Common Category  Total Categories Used
Multi-Agent-Star              565                20              6                0.00              4.08 PRAGMATIC_BALANCING_CONTEXTUAL                     18
    Single-Agent              900                90              1                0.44              3.43 HARM_AVOIDANCE_NON_MALEFICENCE                     15
     Multi-Agent            25680                90              1                0.04              4.39 HARM_AVOIDANCE_NON_MALEFICENCE                     22
        Combined            27145                90              6                0.05              4.35 PRAGMATIC_BALANCING_CONTEXTUAL                     23



=== ERROR ANALYSIS ===


Error summary by dataset:
source_dataset                       Combined  Multi-Agent
error_type                                                
ClassificationCallFailed_RetryError        10           10

Notebook execution complete!
