# Synthetic Survey Response Generation
This notebook demonstrates how to use `generate_responses` to generate synthetic survey responses using Large Language Models (LLMs).

## Overview
The `generate_responses` function provides a complete pipeline for generating synthetic survey responses:

- **LLM Integration**: Supports multiple LLM APIs (OpenAI, Anthropic, TogetherAI)
- **Concurrent Processing**: Processes questions concurrently for efficiency
- **Resume Capability**: Saves raw responses incrementally, allowing you to resume interrupted runs
- **Post-Processing**: Automatically extracts structured answers from raw responses
- **Dataset Support**: Works with two datasets:
  - **EEDI**: Educational assessment dataset
  - **OpinionQA**: Opinion polling dataset

## Available Models
The following LLM models are supported:

**OpenAI** (`api_platform='openai'`):
- `gpt-3.5-turbo`
- `gpt-4o-mini`
- `gpt-4o`
- `gpt-5-mini`

**Anthropic** (`api_platform='anthropic'`):
- `claude-3.5-haiku`

**TogetherAI** (`api_platform='togetherai'`):
- `deepseek-v3`
- `llama-3.3-70B-instruct-turbo`
- `mistral-7B-instruct-v0.3`

## Usage Instructions
1. **Set your API key**: 
   - Set the `API_KEY` environment variable before running the notebook, or
   - Modify the `api_key` variable directly in the code cell below

2. **Configure parameters**: 
   - Choose your API platform (`'openai'`, `'anthropic'`, or `'togetherai'`)
   - Select an LLM model from the list above
   - Choose a dataset (`'EEDI'` or `'OpinionQA'`)
   - Set the number of synthetic answers per question (max 200 for both datasets)
   - Specify the output folder name

3. **Run the cell**: Execute the `await generate_responses(...)` call

4. **Check results**: 
   - Raw responses: `data/{dataset_name}/{folder_name}/raw/{llm}.json`
   - Cleaned results: `data/{dataset_name}/{folder_name}/clean/{llm}.json`
   - Random baseline: `data/{dataset_name}/{folder_name}/clean/random.json`

## Key Parameters

- **`api_platform`** (str): API platform to use. Options: `'openai'`, `'anthropic'`, `'togetherai'`
- **`api_key`** (str): Your API key for authentication
- **`llm`** (str): Model identifier (see available models above)
- **`dataset_name`** (str): Dataset name. Options: `'EEDI'` or `'OpinionQA'`
- **`first_synthetic_profile_id`** (int): Starting ID for synthetic profiles (0-200)
- **`num_of_synthetic_answers`** (int): Number of responses per question (max 200)
  - Note: `first_synthetic_profile_id + num_of_synthetic_answers` should not exceed 200
- **`folder_name`** (str): Output folder name (e.g., `'synthetic_answers'` or `'synthetic_answers_testing'`)
- **`max_concurrent_requests`** (int, optional): Number of concurrent API calls (default: 10)
  - Higher values increase throughput but may hit rate limits
  - For `claude-3.5-haiku`, we set it to 1 due to rate limit
- **`max_retries`** (int, optional): Number of retry attempts for failed requests (default: 3)
  - Uses exponential backoff (1s, 2s, 4s, ...)

## Important Notes

- **Resume Capability**: If raw results already exist, the function will only process questions that haven't been completed yet. This allows you to safely interrupt and resume long-running jobs.

- **Processing Time**: Execution time depends on:
  - Number of questions in the dataset
  - Number of synthetic answers per question
  - LLM model and API platform
  - API rate limits
  - Expect several hours for large runs (e.g., 100 answers × many questions)

- **Error Handling**: 
  - Failed API calls are retried with exponential backoff
  - Error responses are saved as strings starting with `"ERROR:"` for debugging
  - The function continues processing even if some requests fail

- **Random Baseline**: A random baseline is automatically generated with `2 × num_of_synthetic_answers` responses per question:
  - **EEDI**: Random binary correctness scores from {0, 1}
  - **OpinionQA**: Random opinion scores from {-1, -1/3, 0, 1/3, 1}

## Output Format

The cleaned results are saved as JSON files with the following structure:
```json
{
  "question_id": [answer1, answer2, answer3, ...],
  ...
}
```

Where:
- For **EEDI**: Each answer is a binary value (0 = incorrect, 1 = correct)
- For **OpinionQA**: Each answer is a numeric score in {-1, -1/3, 0, 1/3, 1}

For more details, see the function documentation in `src/simulations.py`.

In [None]:
"""
Setup: Import required modules and configure the Python path.

This cell:
1. Finds the project root directory by locating src/simulations.py
2. Adds the project root to sys.path so we can import from src/
3. Imports the generate_responses function
"""
import sys
import os

# Find project root by walking up from current directory until we find src/simulations.py
# This works regardless of where the notebook is run from
cwd = os.path.abspath(os.getcwd())
ROOT_DIR = cwd

# Walk up directory tree to find the directory containing src/simulations.py
while not os.path.exists(os.path.join(ROOT_DIR, 'src', 'simulations.py')):
    parent = os.path.dirname(ROOT_DIR)
    if parent == ROOT_DIR:  # Reached filesystem root
        raise FileNotFoundError(f"Could not find src/simulations.py. Started from: {cwd}")
    ROOT_DIR = parent

# Add project root to Python path
if ROOT_DIR not in sys.path:
    sys.path.insert(0, ROOT_DIR)

# Import from src.simulations
from src.simulations import generate_responses

print(f"\u2705 Successfully imported generate_responses from {ROOT_DIR}/src/simulations.py")

In [None]:
"""
Configuration: Set your API key and configure generation parameters.

IMPORTANT: Set your API key before running this cell!
- Option 1 (Recommended): Set the API_KEY environment variable before starting Jupyter
- Option 2: Uncomment and set api_key directly below (not recommended for security)
"""
# Get API key from environment variable
api_key = os.environ.get('API_KEY')

# If API_KEY is not set, you can set it directly here (not recommended for production)
# api_key = 'your-api-key-here'

if api_key is None:
    raise ValueError(
        "API key not found! Please set the API_KEY environment variable or "
        "uncomment and set api_key directly in this cell."
    )

print("\u2705 API key loaded successfully")

"""
Generate synthetic survey responses.

Progress will be displayed with a progress bar. You can safely interrupt and resume later.
"""
await generate_responses(
    api_platform='openai',                   # Options: 'openai', 'anthropic', 'togetherai'
    api_key=api_key,                         # Your API key (set above)
    llm='gpt-4o-mini',                       # See available models in the markdown above
    dataset_name='OpinionQA',                # Options: 'EEDI' or 'OpinionQA'
    first_synthetic_profile_id=0,            # Starting profile ID (0-200)
    num_of_synthetic_answers=100,            # Number of answers per question (max 200)
    folder_name='synthetic_answers_testing', # Output folder name
    max_concurrent_requests=10,              # Concurrent API calls (default: 10)
    max_retries=3                            # Retry attempts for failed requests (default: 3)
)