# Reasoning Models Token Cost Demo

This notebook demonstrates OpenAI's reasoning models at different reasoning effort levels and compares token usage (input, reasoning, and output tokens) for each level.

**Test Prompt:** A spatial reasoning puzzle about a baseball in a box with a hole being shipped from Houston to New York City.

## 1. Install and Import Required Libraries

In [None]:
# Install OpenAI library (uncomment if needed)
# !pip install openai pandas matplotlib

import os
from openai import OpenAI
import pandas as pd
import matplotlib.pyplot as plt
import json

print("Libraries imported successfully!")

## 2. Set Up OpenAI API Client

In [None]:
# Set up the OpenAI API key
# Option 1: Set environment variable OPENAI_API_KEY
# Option 2: Uncomment and add your key directly (not recommended for production)
# os.environ["OPENAI_API_KEY"] = "your-api-key-here"

client = OpenAI()
print("OpenAI client configured successfully!")

## 3. Define the Test Prompt

In [None]:
test_prompt = """I live in Houston, Texas and I have a box with a big hole in it and I need to ship a baseball. I put the ball into the box. Tape it up. Put a label on it. Then send it to my friend in New York City, New York. He picks up the box on Union Street. Takes a cab to his out on Wilson Blvd and goes into his kitchen. He then opens the box and pours out the contents. Where is the baseball?"""

print("Test Prompt:")
print(test_prompt)

## 4. Test with Different Reasoning Levels

We'll test GPT-5.2 with different reasoning effort levels using the Responses API:
- **Minimal** reasoning effort (fastest, cheapest)
- **Low** reasoning effort
- **Medium** reasoning effort (default)
- **High** reasoning effort (most thorough, highest token usage)

In [None]:
# Helper function to query GPT-5.2 and extract token usage
def query_model(model_name, reasoning_effort=None):
    """Query a model using the Responses API and return response with token usage."""
    try:
        if reasoning_effort:
            response = client.responses.create(
                model=model_name,
                input=[{"role": "user", "content": test_prompt}],
                reasoning={"effort": reasoning_effort}
            )
        else:
            # For no reasoning, use "minimal" effort
            response = client.responses.create(
                model=model_name,
                input=[{"role": "user", "content": test_prompt}],
                reasoning={"effort": "minimal"}
            )
        
        # Extract token usage from the response
        usage = response.usage
        
        # Get reasoning tokens from output_tokens_details
        reasoning_tokens = 0
        if hasattr(usage, 'output_tokens_details') and usage.output_tokens_details:
            reasoning_tokens = getattr(usage.output_tokens_details, 'reasoning_tokens', 0) or 0
        
        result = {
            "model": model_name,
            "reasoning_effort": reasoning_effort if reasoning_effort else "minimal",
            "input_tokens": usage.input_tokens,
            "reasoning_tokens": reasoning_tokens,
            "output_tokens": usage.output_tokens,
            "total_tokens": usage.total_tokens,
            "answer": response.output_text
        }
        return result
    except Exception as e:
        return {
            "model": model_name,
            "reasoning_effort": reasoning_effort if reasoning_effort else "minimal",
            "error": str(e),
            "input_tokens": 0,
            "reasoning_tokens": 0,
            "output_tokens": 0,
            "total_tokens": 0,
            "answer": f"Error: {str(e)}"
        }

print("Helper function defined successfully!")

### 4.1 GPT-5.2 with Minimal Reasoning Effort

In [None]:
print("Testing GPT-5.2 with minimal reasoning effort...")
result_gpt52_minimal = query_model("gpt-5.2")  # defaults to minimal
print(f"\nModel: {result_gpt52_minimal['model']}")
print(f"Reasoning Effort: {result_gpt52_minimal['reasoning_effort']}")
print(f"Input Tokens: {result_gpt52_minimal['input_tokens']}")
print(f"Reasoning Tokens: {result_gpt52_minimal['reasoning_tokens']}")
print(f"Output Tokens: {result_gpt52_minimal['output_tokens']}")
print(f"Total Tokens: {result_gpt52_minimal['total_tokens']}")
print(f"\nAnswer: {result_gpt52_minimal['answer'][:500]}...")

### 4.2 GPT-5.2 with Low Reasoning Effort

In [None]:
print("Testing GPT-5.2 with low reasoning effort...")
result_gpt52_low = query_model("gpt-5.2", reasoning_effort="low")
print(f"\nModel: {result_gpt52_low['model']}")
print(f"Reasoning Effort: {result_gpt52_low['reasoning_effort']}")
print(f"Input Tokens: {result_gpt52_low['input_tokens']}")
print(f"Reasoning Tokens: {result_gpt52_low['reasoning_tokens']}")
print(f"Output Tokens: {result_gpt52_low['output_tokens']}")
print(f"Total Tokens: {result_gpt52_low['total_tokens']}")
print(f"\nAnswer: {result_gpt52_low['answer'][:500]}...")

### 4.3 GPT-5.2 with Medium Reasoning Effort

In [None]:
print("Testing GPT-5.2 with medium reasoning effort...")
result_gpt52_medium = query_model("gpt-5.2", reasoning_effort="medium")
print(f"\nModel: {result_gpt52_medium['model']}")
print(f"Reasoning Effort: {result_gpt52_medium['reasoning_effort']}")
print(f"Input Tokens: {result_gpt52_medium['input_tokens']}")
print(f"Reasoning Tokens: {result_gpt52_medium['reasoning_tokens']}")
print(f"Output Tokens: {result_gpt52_medium['output_tokens']}")
print(f"Total Tokens: {result_gpt52_medium['total_tokens']}")
print(f"\nAnswer: {result_gpt52_medium['answer'][:500]}...")

### 4.4 GPT-5.2 with High Reasoning Effort

In [None]:
print("Testing GPT-5.2 with high reasoning effort...")
result_gpt52_high = query_model("gpt-5.2", reasoning_effort="high")
print(f"\nModel: {result_gpt52_high['model']}")
print(f"Reasoning Effort: {result_gpt52_high['reasoning_effort']}")
print(f"Input Tokens: {result_gpt52_high['input_tokens']}")
print(f"Reasoning Tokens: {result_gpt52_high['reasoning_tokens']}")
print(f"Output Tokens: {result_gpt52_high['output_tokens']}")
print(f"Total Tokens: {result_gpt52_high['total_tokens']}")
print(f"\nAnswer: {result_gpt52_high['answer'][:500]}...")

## 5. Compare Token Usage Across All Reasoning Levels

In [None]:
# Compile all results into a dataframe
results = [
    result_gpt52_minimal,
    result_gpt52_low,
    result_gpt52_medium,
    result_gpt52_high
]

# Create comparison dataframe
df = pd.DataFrame(results)
df_display = df[['model', 'reasoning_effort', 'input_tokens', 'reasoning_tokens', 'output_tokens', 'total_tokens']]

print("\n" + "="*80)
print("TOKEN USAGE COMPARISON - GPT-5.2 ACROSS REASONING LEVELS")
print("="*80)
print(df_display.to_string(index=False))
print("="*80)

## 6. Visualize Token Costs

In [None]:
# Create labels for plotting
effort_labels = ['Minimal', 'Low', 'Medium', 'High']
df['label'] = effort_labels

# Create stacked bar chart
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Stacked bar chart showing token breakdown
x_pos = range(len(df))
ax1.bar(x_pos, df['input_tokens'], label='Input Tokens', color='#4CAF50')
ax1.bar(x_pos, df['reasoning_tokens'], bottom=df['input_tokens'], 
        label='Reasoning Tokens', color='#FF9800')
ax1.bar(x_pos, df['output_tokens'], 
        bottom=df['input_tokens'] + df['reasoning_tokens'],
        label='Output Tokens', color='#2196F3')

ax1.set_xlabel('Reasoning Effort Level', fontsize=12, fontweight='bold')
ax1.set_ylabel('Token Count', fontsize=12, fontweight='bold')
ax1.set_title('GPT-5.2 Token Usage Breakdown by Reasoning Level', fontsize=14, fontweight='bold')
ax1.set_xticks(x_pos)
ax1.set_xticklabels(effort_labels)
ax1.legend()
ax1.grid(axis='y', alpha=0.3)

# Total tokens comparison
colors = ['#4CAF50', '#8BC34A', '#FF9800', '#FF5722']
ax2.bar(x_pos, df['total_tokens'], color=colors)
ax2.set_xlabel('Reasoning Effort Level', fontsize=12, fontweight='bold')
ax2.set_ylabel('Total Tokens', fontsize=12, fontweight='bold')
ax2.set_title('GPT-5.2 Total Token Usage by Reasoning Level', fontsize=14, fontweight='bold')
ax2.set_xticks(x_pos)
ax2.set_xticklabels(effort_labels)
ax2.grid(axis='y', alpha=0.3)

# Add value labels on bars
for i, v in enumerate(df['total_tokens']):
    ax2.text(i, v + max(df['total_tokens']) * 0.02, str(v), 
             ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.show()

print("\nVisualization complete!")

## 7. View Full Answers from Each Reasoning Level

In [None]:
for i, result in enumerate(results, 1):
    print(f"\n{'='*80}")
    print(f"RESULT #{i}: GPT-5.2 ({result['reasoning_effort']})")
    print(f"{'='*80}")
    print(f"Input Tokens: {result['input_tokens']}")
    print(f"Reasoning Tokens: {result['reasoning_tokens']}")
    print(f"Output Tokens: {result['output_tokens']}")
    print(f"Total Tokens: {result['total_tokens']}")
    print(f"\nFull Answer:\n{result['answer']}")
    print(f"{'='*80}\n")

## 8. Key Insights

**About the Responses API:**

The OpenAI Responses API (`client.responses.create()`) is the recommended way to use GPT-5.2 and other reasoning models. Key parameters:
- `reasoning={"effort": "minimal|low|medium|high"}` - Controls reasoning depth
- `input=[{"role": "user", "content": "..."}]` - The input messages
- Response includes `output_text` for the answer and `usage.output_tokens_details.reasoning_tokens` for reasoning token count

**Expected Observations:**

1. **Minimal Reasoning** - Fastest response with minimal internal reasoning. May miss nuanced details in complex problems.

2. **Low Reasoning** - Light reasoning overhead, good for straightforward questions.

3. **Medium Reasoning** (default) - Balanced approach between speed and thoroughness.

4. **High Reasoning** - Maximum reasoning effort with highest token usage. Best for complex, multi-step problems.

5. **Token Cost Trade-off**: As reasoning effort increases:
   - More **reasoning tokens** are generated (internal "thinking")
   - Answers tend to be more thorough and accurate
   - Total token costs increase significantly
   - Response times are longer

6. **The Baseball Question**: This tests spatial reasoning - the ball likely fell out through the hole during shipping. Higher reasoning levels should be more likely to catch this critical detail and correctly answer that the baseball is "somewhere between Houston and New York" rather than "in the box."