# Banking77 Synthesis API Usage

This notebook demonstrates how to use the synthesis API to generate training examples for all 77 Banking77 labels.

## Overview

The synthesis API can generate synthetic training examples by:
1. **Direct Config Approach**: Create a synthesis recipe config and submit it directly
2. **Generate Config Approach**: Use the API to auto-generate a config from a task definition

Both approaches will generate conversations where:
- **User message**: A realistic banking customer service query
- **Assistant message**: The label ID (0-76) corresponding to the intent

## Configuration

**Important**: 
- For localhost connections, use `http://localhost:8000` (HTTP, not HTTPS).
- The notebook will automatically disable SSL verification for localhost connections.
- The API uses `X-API-Key` header for authentication (default: `123` for local development).
- Set the `SYNTHESIS_API_KEY` environment variable or update `API_KEY` in the cells for production.
- **OpenAI API Key**: The synthesis config will automatically read `OPENAI_API_KEY` from the environment if set.
  This is required for the synthesis API to call OpenAI models. Set it with:
  ```bash

  ```

In [8]:
# Import required libraries
import json
import os
from utils import (
    LABEL_NAMES_MAP,
    create_banking77_synthesis_config,
    create_banking77_synthesis_config_test_only,
    create_banking77_synthesis_config_from_task_definition
)

# Display basic info
print(f"Total labels in Banking77: {len(LABEL_NAMES_MAP)}")
print(f"\nFirst 5 labels:")
for i, (label_id, label_name) in enumerate(list(LABEL_NAMES_MAP.items())[:5]):
    print(f"  {label_id}: {label_name}")

# Check if OPENAI_API_KEY is set
openai_key = os.environ.get("OPENAI_API_KEY")
if openai_key:
    print(f"\n✓ OPENAI_API_KEY is set (length: {len(openai_key)})")
else:
    print("\n⚠ Warning: OPENAI_API_KEY not set in environment.")
    print("  The synthesis API needs this to call OpenAI models.")
    print("  Set it with: export OPENAI_API_KEY='your-key-here'")

Total labels in Banking77: 77

First 5 labels:
  0: activate_my_card
  1: age_limit
  2: apple_pay_or_google_pay
  3: atm_support
  4: automatic_top_up

✓ OPENAI_API_KEY is set (length: 164)


## Approach 1: Direct Synthesis Config

Create a synthesis recipe config directly and submit it to the API.

In [9]:
# Create synthesis config for generating training examples
# Adjust num_samples based on how many examples you want per label
num_samples = 10000  # Total examples across all labels
requests_per_minute = 1000  # Rate limit for API requests (optional, default: 500)
inference_max_new_tokens = 1024  # Max tokens for model output (increase if hitting token limit errors)
config = create_banking77_synthesis_config(
    num_samples=num_samples,
    requests_per_minute=requests_per_minute,
    inference_max_new_tokens=inference_max_new_tokens
)

# Display the config structure
print("Synthesis Config Structure:")
print(f"  Type: {config['type']}")
print(f"  Model: {config['model_identifier']['model_name']}")
print(f"  Num Samples: {config['synthesis_config']['synthesis_config']['num_samples']}")
print(f"  Strategy: {config['synthesis_config']['synthesis_config']['strategy']}")
print(f"  Labels: {len(config['synthesis_config']['synthesis_config']['strategy_params']['sampled_attributes'][0]['possible_values'])} labels")

# Optionally save to file for inspection
import os
os.makedirs("data/synth", exist_ok=True)
config_path = "data/synth/synthesis_config.json"
with open(config_path, "w") as f:
    json.dump({"recipe": {"recipe_config": config}}, f, indent=2)
print(f"\n✓ Config saved to {config_path}")

Synthesis Config Structure:
  Type: synthesize
  Model: gpt-5-mini
  Num Samples: 10000
  Strategy: general
  Labels: 77 labels

✓ Config saved to data/synth/synthesis_config.json


# Create test set only with labels

In [5]:


num_samples = 1000  # Total examples across all labels
requests_per_minute = 1000  # Rate limit for API requests (optional, default: 500)
inference_max_new_tokens = 1024  # Max tokens for model output (increase if hitting token limit errors)
config = create_banking77_synthesis_config_test_only(
    num_samples=num_samples,
    requests_per_minute=requests_per_minute,
    inference_max_new_tokens=inference_max_new_tokens
)

# Optionally save to file for inspection
import os
os.makedirs("data/synth", exist_ok=True)
config_path = "data/synth/synthesis_config_test_only.json"
with open(config_path, "w") as f:
    json.dump({"recipe": {"recipe_config": config}}, f, indent=2)
print(f"\n✓ Config saved to {config_path}")


✓ Config saved to data/synth/synthesis_config_test_only.json


In [None]:
# Display config (API keys are read from environment variables)
# Note: api_keys will be None if OPENAI_API_KEY is not set in environment
print("Config structure:")
print(f"  Type: {config.get('type')}")
print(f"  Model: {config.get('model_identifier', {}).get('model_name')}")
if config.get('model_identifier', {}).get('api_keys'):
    print("  ✓ API keys configured (from environment)")
else:
    print("  ⚠ API keys not set (set OPENAI_API_KEY environment variable)")

In [10]:
# Example: How to call the synthesis API
import requests

# Set your API credentials
# For localhost, use HTTP (not HTTPS)
# For production, use HTTPS
API_BASE_URL = os.environ.get("SYNTHESIS_API_BASE_URL", "http://localhost:8000")
API_KEY = os.environ.get("SYNTHESIS_API_KEY", "123")  # Default for local dev, change for production
PROJECT_ID = os.environ.get("PROJECT_ID", "1")  # Your project ID

# Determine if we should verify SSL (disable for localhost HTTP)
verify_ssl = not API_BASE_URL.startswith("http://localhost") and not API_BASE_URL.startswith("http://127.0.0.1")

# Prepare the request
url = f"{API_BASE_URL}/v1/projects/{PROJECT_ID}/synthesis:run"
headers = {
    "accept": "application/json",
    "Content-Type": "application/json",
    "X-API-Key": API_KEY
}

# Option 1: Use a new recipe config (creates a new recipe)
payload = {
    "recipe": {
        "recipe_config": config
    }
}

# Option 2: Use an existing recipe (uncomment to use instead)
# payload = {
#     "recipe": {
#         "recipeId": 1,  # Your existing recipe ID
#         "recipeVersion": 1  # Optional: recipe version (uses latest if not specified)
#     }
# }

# Submit the synthesis job
try:
    response = requests.post(url, json=payload, headers=headers, verify=verify_ssl)
    response.raise_for_status()
    
    operation = response.json()
    print(f"✓ Synthesis job started!")
    print(f"  Operation ID: {operation.get('id')}")
    print(f"  Status: {operation.get('status')}")
    print(f"  Check status at: {API_BASE_URL}/v1/projects/{PROJECT_ID}/operations/{operation.get('id')}")
except requests.exceptions.SSLError as e:
    print(f"SSL Error: {e}")
    print("\nIf connecting to localhost, make sure you're using HTTP (http://) not HTTPS (https://)")
    print("Or set verify_ssl=False if you need to bypass SSL verification")
except requests.exceptions.ConnectionError as e:
    print(f"Connection Error: {e}")
    print(f"\nMake sure the API server is running at {API_BASE_URL}")
except requests.exceptions.HTTPError as e:
    print(f"HTTP Error: {e}")
    print(f"Response: {response.text if 'response' in locals() else 'No response'}")
except Exception as e:
    print(f"Error: {e}")

✓ Synthesis job started!
  Operation ID: 64
  Status: pending
  Check status at: http://localhost:8000/v1/projects/1/operations/64


In [None]:
    "test_dataset_id": 21,
    "train_dataset_id": 19,
    "validation_dataset_id": 20
  }

## Approach 2: Generate Config from Task Definition

Use the API's config generation feature to automatically create a synthesis config.

In [4]:
# Create a task definition request
# This will be sent to the generate-config endpoint
num_samples = 100
generate_config_request = create_banking77_synthesis_config_from_task_definition(
    num_samples=num_samples
)

# Display the task definition
print("Task Definition:")
print("=" * 80)
print(generate_config_request["task_definition"])
print("=" * 80)

# Save for inspection
import os
os.makedirs("data/synth", exist_ok=True)
request_path = "data/synth/generate_config_request.json"
with open(request_path, "w") as f:
    json.dump(generate_config_request, f, indent=2)
print(f"\n✓ Request saved to {request_path}")

Task Definition:
Generate training examples for a banking intent classification task.

The task is to classify customer service queries into one of 77 banking intent categories.

Labels:
0: activate_my_card, 1: age_limit, 2: apple_pay_or_google_pay, 3: atm_support, 4: automatic_top_up, 5: balance_not_updated_after_bank_transfer, 6: balance_not_updated_after_cheque_or_cash_deposit, 7: beneficiary_not_allowed, 8: cancel_transfer, 9: card_about_to_expire, 10: card_acceptance, 11: card_arrival, 12: card_delivery_estimate, 13: card_linking, 14: card_not_working, 15: card_payment_fee_charged, 16: card_payment_not_recognised, 17: card_payment_wrong_exchange_rate, 18: card_swallowed, 19: cash_withdrawal_charge, 20: cash_withdrawal_not_recognised, 21: change_pin, 22: compromised_card, 23: contactless_not_working, 24: country_support, 25: declined_card_payment, 26: declined_cash_withdrawal, 27: declined_transfer, 28: direct_debit_payment_not_recognised, 29: disposable_card_limits, 30: edit_perso

In [5]:
# Example: How to call the generate-config endpoint
# This is a two-step process:
# 1. Generate the config
# 2. Use the generated config to run synthesis

import requests

API_BASE_URL = os.environ.get("SYNTHESIS_API_BASE_URL", "http://localhost:8000")
API_KEY = os.environ.get("SYNTHESIS_API_KEY", "123")  # Default for local dev, change for production
PROJECT_ID = os.environ.get("PROJECT_ID", "1")

# Determine if we should verify SSL (disable for localhost HTTP)
verify_ssl = not API_BASE_URL.startswith("http://localhost") and not API_BASE_URL.startswith("http://127.0.0.1")

headers = {
    "accept": "application/json",
    "Content-Type": "application/json",
    "X-API-Key": API_KEY
}

# Step 1: Generate the config
generate_url = f"{API_BASE_URL}/v1/projects/{PROJECT_ID}/synthesis:generate-config"
try:
    generate_response = requests.post(
        generate_url,
        json=generate_config_request,
        headers=headers,
        verify=verify_ssl
    )
    generate_response.raise_for_status()
    
    generate_operation = generate_response.json()
    print(f"✓ Config generation started!")
    print(f"  Operation ID: {generate_operation.get('id')}")
    
    # Step 2: Wait for config generation to complete, then retrieve the config
    # (You'll need to poll the operation status)
    # Once complete, use the generated config in the synthesis:run endpoint
except requests.exceptions.SSLError as e:
    print(f"SSL Error: {e}")
    print("\nIf connecting to localhost, make sure you're using HTTP (http://) not HTTPS (https://)")
except requests.exceptions.ConnectionError as e:
    print(f"Connection Error: {e}")
    print(f"\nMake sure the API server is running at {API_BASE_URL}")
except Exception as e:
    print(f"Error: {e}")

✓ Config generation started!
  Operation ID: 40


## Inspecting the Config

Let's look at the structure of the generated config to understand what it contains.

In [None]:
# Inspect the sampled attributes (labels)
sampled_attr = config['synthesis_config']['synthesis_config']['strategy_params']['sampled_attributes'][0]
print("Sampled Attribute (Labels):")
print(f"  ID: {sampled_attr['id']}")
print(f"  Name: {sampled_attr['name']}")
print(f"  Description: {sampled_attr['description']}")
print(f"  Number of possible values: {len(sampled_attr['possible_values'])}")
print(f"\n  First 5 label values:")
for val in sampled_attr['possible_values'][:5]:
    print(f"    - {val['id']}: {val['name']} ({val['description']})")

# Inspect the generated attribute (user queries)
generated_attr = config['synthesis_config']['synthesis_config']['strategy_params']['generated_attributes'][0]
print(f"\nGenerated Attribute (User Queries):")
print(f"  ID: {generated_attr['id']}")
print(f"  Instruction messages: {len(generated_attr['instruction_messages'])}")
print(f"\n  System message:")
print(f"    {generated_attr['instruction_messages'][0]['content'][:100]}...")

# Inspect the transformed attribute (conversation format)
transformed_attr = config['synthesis_config']['synthesis_config']['strategy_params']['transformed_attributes'][0]
print(f"\nTransformed Attribute (Conversation):")
print(f"  ID: {transformed_attr['id']}")
print(f"  Transformation type: {transformed_attr['transformation_strategy']['type']}")
print(f"  Messages in conversation: {len(transformed_attr['transformation_strategy']['chat_transform']['messages'])}")

## Expected Output Format

The synthesis API will generate examples in conversation format. Here's what each example will look like:

In [None]:
# Example of what a generated training example will look like
example_output = {
    "messages": [
        {
            "role": "user",
            "content": "Hi, I need to activate my new credit card. How do I do that?"
        },
        {
            "role": "assistant",
            "content": "0"  # Label ID for "activate_my_card"
        }
    ],
    "metadata": {
        "label_name": "activate_my_card",
        "label_id": "0"
    }
}

print("Example Generated Training Example:")
print(json.dumps(example_output, indent=2))

## Checking Operation Status

After submitting a synthesis job, you'll need to check the operation status to see when it completes.

In [None]:
# Example: How to check operation status and retrieve results

import requests
import time

API_BASE_URL = os.environ.get("SYNTHESIS_API_BASE_URL", "http://localhost:8000")
API_KEY = os.environ.get("SYNTHESIS_API_KEY", "123")  # Default for local dev, change for production
PROJECT_ID = os.environ.get("PROJECT_ID", "1")
OPERATION_ID = "your_operation_id"  # From the synthesis:run response

# Determine if we should verify SSL (disable for localhost HTTP)
verify_ssl = not API_BASE_URL.startswith("http://localhost") and not API_BASE_URL.startswith("http://127.0.0.1")

headers = {
    "accept": "application/json",
    "Content-Type": "application/json",
    "X-API-Key": API_KEY
}

# Poll for operation status
status_url = f"{API_BASE_URL}/v1/projects/{PROJECT_ID}/operations/{OPERATION_ID}"
max_wait_time = 3600  # 1 hour max
poll_interval = 10  # Check every 10 seconds
elapsed = 0

while elapsed < max_wait_time:
    try:
        response = requests.get(status_url, headers=headers, verify=verify_ssl)
        response.raise_for_status()
        operation = response.json()
        
        status = operation.get('status')
        print(f"Status: {status}")
        
        if status == 'COMPLETED':
            print("✓ Synthesis completed!")
            # The dataset will be available at the dataset_id specified in the config
            # or a new dataset will be created
            dataset_id = operation.get('status_metadata', {}).get('dataset_id')
            print(f"Dataset ID: {dataset_id}")
            break
        elif status == 'FAILED':
            print("✗ Synthesis failed!")
            error = operation.get('error', {})
            print(f"Error: {error.get('message', 'Unknown error')}")
            break
        else:
            print(f"Waiting... ({elapsed}s elapsed)")
            time.sleep(poll_interval)
            elapsed += poll_interval
    except Exception as e:
        print(f"Error checking status: {e}")
        break

## Using Generated Data

Once the synthesis is complete, you can use the generated dataset for training. The data will be in the same format as your existing Banking77 training data.

In [None]:
# Example: Loading and using the generated dataset
# (This assumes you've downloaded or have access to the dataset)

# from utils import load_conversations

# # Load the generated dataset
# generated_data_path = "data/banking77_synthetic_train.jsonl"
# generated_conversations = load_conversations(generated_data_path)

# print(f"Loaded {len(generated_conversations)} synthetic examples")

# # Display a few examples
# print("\nSample generated examples:")
# for i, conv in enumerate(generated_conversations[:3]):
#     print(f"\nExample {i+1}:")
#     for msg in conv:
#         print(f"  {msg['role']}: {msg['content']}")
    
# You can now:
# 1. Combine with existing training data
# 2. Use for fine-tuning
# 3. Evaluate the quality
# 4. Filter or post-process as needed

## Tips and Best Practices

1. **Start Small**: Test with a small `num_samples` (e.g., 10-50) first to verify the config works
2. **Monitor Costs**: Synthesis uses LLM API calls, so monitor your usage
3. **Label Distribution**: The current config uses uniform sampling. You can adjust `sample_rate` in the label values to weight certain labels
4. **Customize Prompts**: Modify the `instruction_messages` in the generated attribute to change the style of queries
5. **Quality Check**: Always review a sample of generated data before using for training
6. **Combine with Real Data**: Consider mixing synthetic data with your existing real training data
7. **Token Limit Errors**: If you see "max_tokens or model output limit was reached" errors, increase `inference_max_new_tokens` (default: 1024, try 2048 or 4096)

In [None]:
# Example: Customizing the config for weighted label sampling
# If you want more examples for certain labels, adjust sample_rate

def create_weighted_synthesis_config(num_samples=100, label_weights=None):
    """
    Create a synthesis config with custom label weights.
    
    Args:
        num_samples: Total number of examples
        label_weights: Dict mapping label_id to weight (e.g., {0: 0.1, 1: 0.05, ...})
                      If None, uses uniform distribution
    """
    config = create_banking77_synthesis_config(
        num_samples=num_samples,
        inference_max_new_tokens=2048  # Increase if needed
    )
    
    if label_weights:
        # Normalize weights to sum to 1.0
        total_weight = sum(label_weights.values())
        normalized_weights = {k: v/total_weight for k, v in label_weights.items()}
        
        # Update sample_rate for each label value
        for val in config['synthesis_config']['synthesis_config']['strategy_params']['sampled_attributes'][0]['possible_values']:
            label_id = int(val['id'])
            if label_id in normalized_weights:
                val['sample_rate'] = normalized_weights[label_id]
    
    return config

# Example: Give more weight to certain labels
# important_labels = {0: 0.2, 1: 0.15, 2: 0.1}  # 45% of samples
# weighted_config = create_weighted_synthesis_config(
#     num_samples=1000,
#     label_weights=important_labels
# )