# Evaluation and Analysis of Synthetic Survey Responses

This notebook demonstrates how to evaluate synthetic survey responses generated by LLMs and analyze their performance. This is part of the **How Many Human Survey Respondents is a Large Language Model Worth? An Uncertainty Quantification Perspective** project. 

## Overview
The evaluation pipeline consists of three main steps:

1. **Run Evaluations** (`evaluations()`): 
   - Evaluates all models on the specified dataset
   - Performs train-test splits to find optimal sample sizes (k_hat)
   - Computes miscoverage rates, confidence interval half-widths, and other metrics
   - Saves results to JSON files for later visualization

2. **Generate Plots** (`plot_from_saved_evaluations()`):
   - Creates visualizations from saved evaluation results
   - Generates plots for kappa_hat, synthetic CI half-width, and test miscoverage rates
   - Compares performance across different models and alpha values

3. **Sharpness Analysis** (`sharpness_analysis()`):
   - Analyzes the sharpness of confidence intervals
   - Visualizes miscoverage rates as a function of sample size k
   - Helps identify optimal sample sizes for each model

Runtime note: `src/evaluations.py` now caches real confidence intervals and per-question prefix statistics for the synthetic answers, so repeated evaluations (e.g., during sweeps over many random splits) run significantly faster.

## Key Concepts

- **Alpha (α)**: Significance level for synthetic confidence intervals. A smaller alpha means higher confidence (e.g., α=0.05 → 95% confidence).
- **Gamma (γ)**: Coverage probability for real confidence intervals (0.5 for 50% coverage).
- **k_hat**: Optimal number of synthetic samples needed to achieve desired coverage.
- **Miscoverage Rate**: Proportion of cases where the synthetic CI fails to cover the real population parameter.
- **Coverage Types**:
  - **General**: Confidence set inclusion test
  - **Simple**: Empirical mean inclusion test

## Usage Instructions
1. **Configure parameters**: Set dataset name, models, alpha values, and other evaluation parameters in the configuration cell below.

2. **Run evaluations**: Execute `evaluations()` to compute evaluation metrics. This may take some time depending on the number of models, splits, and k_max.

3. **Generate plots**: Execute `plot_from_saved_evaluations()` to create visualizations from the saved results.

4. **Analyze sharpness**: Execute `sharpness_analysis()` for detailed sharpness analysis of specific models.

## Key Parameters

- **`dataset_name`** (str): Dataset to evaluate. Options: `'EEDI'` or `'OpinionQA'`
- **`models`** (list): List of model names to evaluate (must match synthetic answer file names)
- **`alphas`** (list): Significance levels to evaluate (e.g., `[0.05, 0.10, 0.15, 0.20]`)
- **`gamma`** (float): Coverage probability for real CIs (default: 0.5)
- **`k_max`** (int): Maximum number of synthetic samples to evaluate (200 for OpinionQA, 60 for EEDI)
- **`C`** (float): Scaling constant for synthetic CI half-width (default: 2)
- **`train_proportion`** (float): Proportion of questions for training (default: 0.6)
- **`num_splits`** (int): Number of train-test splits for robust statistics (default: 100)
- **`CI_type`** (str): Type of confidence interval. Options: `'clt'`, `'hoeffding'`, `'bernstein'`

## Output Files

Results are saved to `data/{dataset_name}/{evaluation_results_folder_name}/`:
- `general/reports_all.json`: Evaluation reports for general coverage type
- `simple/reports_all.json`: Evaluation reports for simple coverage type
- `general/sharpness_analysis_all.json`: Sharpness analysis results for general coverage type
- `simple/sharpness_analysis_all.json`: Sharpness analysis results for simple coverage type
- `general/{metric}.pdf`: Plots for general coverage type
- `simple/{metric}.pdf`: Plots for simple coverage type

For more details, see the function documentation in `src/evaluations.py`.

In [None]:
"""
Setup: Import required modules and configure the Python path.

This cell:
1. Finds the project root directory by locating src/evaluations.py
2. Adds the project root to sys.path so we can import from src/
3. Imports all evaluation functions from src.evaluations
"""
import sys
import os

# Find project root by walking up from current directory until we find src/evaluations.py
# This works regardless of where the notebook is run from
cwd = os.path.abspath(os.getcwd())
ROOT_DIR = cwd

# Walk up directory tree to find the directory containing src/evaluations.py
while not os.path.exists(os.path.join(ROOT_DIR, 'src', 'evaluations.py')):
    parent = os.path.dirname(ROOT_DIR)
    if parent == ROOT_DIR:  # Reached filesystem root
        raise FileNotFoundError(f"Could not find src/evaluations.py. Started from: {cwd}")
    ROOT_DIR = parent

# Add project root to Python path
if ROOT_DIR not in sys.path:
    sys.path.insert(0, ROOT_DIR)

# Import from src.evaluations
from src.evaluations import *

print(f"\u2705 Successfully imported evaluation functions from {ROOT_DIR}/src/evaluations.py")

In [None]:
"""
Configuration: Set evaluation parameters.

Modify these parameters to customize your evaluation:
- Dataset and models to evaluate
- Alpha values (significance levels) to test
- Evaluation settings (k_max, C, train_proportion, etc.)
"""
# Dataset configuration
dataset_name = 'OpinionQA'  # Options: 'EEDI' or 'OpinionQA'

# Models to evaluate (must match synthetic answer file names)
models = [
    'claude-3.5-haiku', 
    'deepseek-v3', 
    'gpt-3.5-turbo', 
    'gpt-4o-mini', 
    'gpt-4o', 
    'gpt-5-mini', 
    'llama-3.3-70B-instruct-turbo', 
    'mistral-7B-instruct-v0.3', 
    'random'  # Random baseline for comparison
]

# Folder names
synthetic_answer_folder_name = 'synthetic_answers'  # Folder containing synthetic answers
evaluation_results_folder_name = 'evaluation_results'  # Folder to save evaluation results

# Significance levels to evaluate (alpha values)
# Each alpha corresponds to a confidence level: 1 - alpha
# Example: alpha=0.05 → 95% confidence level
alphas = [0.05, 0.10, 0.15, 0.20]

# Coverage probability for real confidence intervals
gamma = 0.5  # 50% coverage

# Maximum number of synthetic samples to evaluate
k_max = 200  # Use 200 for OpinionQA, 100 for EEDI

# Scaling constant for synthetic CI half-width
C = 2  # Higher C → wider (more conservative) confidence intervals

# Proportion of questions to use for training
train_proportion = 0.6  # 60% training, 40% testing

# Minimum k value required for valid synthetic CI
k_min = 2  # Minimum number of samples needed (must be at least 2)

# Type of confidence interval to compute
CI_type = 'bernstein'  # Options: 'clt', 'hoeffding', 'bernstein'

# Random seed for reproducibility
seed = 0

# Number of train-test splits for robust statistics
num_splits = 100  # More splits → more reliable but slower

print("\u2705 Configuration loaded successfully")
print(f"   Dataset: {dataset_name}")
print(f"   Models: {len(models)} models")
print(f"   Alpha values: {alphas}")
print(f"   k_max: {k_max}, CI_type: {CI_type}, num_splits: {num_splits}")

In [None]:
"""
Step 1: Run evaluations for all models.

This function:
- Loads real answers from the dataset and synthetic answers from all models
- Performs train-test splits to find optimal sample sizes (k_hat)
- Computes miscoverage rates, confidence interval half-widths, and other metrics
- Saves results to JSON files for later visualization

Execution time: Depends on number of models, splits, and k_max. 
Expect several minutes to hours for large evaluations.

Results are saved to: data/{dataset_name}/{evaluation_results_folder_name}/
"""

evaluations(
    dataset_name=dataset_name,
    models=models,
    synthetic_answer_folder_name=synthetic_answer_folder_name, 
    evaluation_results_folder_name=evaluation_results_folder_name,
    alphas=alphas,
    gamma=gamma,
    k_max=k_max,
    C=C,
    train_proportion=train_proportion,
    k_min=k_min,
    CI_type=CI_type,
    seed=seed,
    num_splits=num_splits
)

In [None]:
"""
Step 2: Generate plots from saved evaluation results.

This function creates visualizations comparing all models:
- kappa_hat: Optimal sample size (k_hat) as a function of alpha
- synth_CI_width: Synthetic confidence interval half-width
- test_miscov_rate: Test miscoverage rate

Plots are generated for both 'general' and 'simple' coverage types.
Plots are saved as PDF files in the evaluation_results folder.

Note: This function reads from saved JSON files, so Step 1 must be completed first.
"""
plot_from_saved_evaluations(
    dataset_name=dataset_name,
    evaluation_results_folder_name=evaluation_results_folder_name,
    num_splits=num_splits,
    alphas=alphas,
    gamma=gamma,
    C=C,
    types=['simple', 'general']
)

In [None]:
"""
Step 3: Sharpness analysis
"""

sharpness_analysis(
    dataset_name=dataset_name,
    evaluation_results_folder_name=evaluation_results_folder_name,
    type='simple',
    histogram_model='gpt-4o'
) 