### OpenAI Cookbook: Working with O‑Series Models (o3‑mini)
========================================================

This cookbook demonstrates how to:
• Set up and call an O‑series model with different reasoning efforts.
• Evaluate model responses using sample questions and a public dataset.
• Analyze token usage (total and reasoning tokens), response times, and accuracy.
• Understand best practices for prompting, reasoning effort levels, and response logging.

Before running this code, ensure you have installed the required libraries:
   -  pip install openai --upgrade
   -  pip install pandas
   -  pip install datasets

You should also have your OpenAI API key set in the environment variable:
    OPENAI_API_KEY

In [38]:
import os
import time
import logging
import pandas as pd
from openai import OpenAI

# Initialize the OpenAI client
client = OpenAI()
MODEL_NAME = "o3-mini"  # Choose your model

In [39]:
# Sample questions and expected answers
sample_questions = [
    {
        "question": "A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball. How much does the ball cost?",
        "answer": "0.05"  # Expected correct answer in dollars
    },
    {
        "question": "If 5 machines take 5 minutes to make 5 widgets, how long would 100 machines take to make 100 widgets?",
        "answer": "5 minutes"  # Expected correct answer
    },
    {
        "question": "The $n=2$ level of the hydrogen atom is fourfold degenerate (the $2s$ and $2p$ orbitals). Consider a hydrogen atom placed in a constant electric field $E$ directed along the $z$-axis (Stark effect). Using first-order degenerate perturbation theory, determine the energy shifts of the $n=2$ levels to first order in $E$. Identify which states are mixed by the perturbation and find the new eigenstates and their energies. (Neglect fine structure and spin.)",
        "answer": "TBD"
    },
    {
        "question": "Consider $N$ non-interacting spin-$\frac{1}{2}$ particles (such as paramagnetic atoms with a single unpaired electron) in a uniform magnetic field $B$ at temperature $T$. Each spin can be either aligned with the field (down state energy $+\mu B$) or against the field (up state energy $-\mu B$), where $\mu$ is the magnetic moment (assume $\mu B \ll$ any saturation limits so no other levels). (a) Derive an expression for the total magnetization $M$ of the system as a function of $B$ and $T$. (b) Discuss the behavior of $M$ in the limits of very low temperature and very high temperature, and verify that it is consistent with physical expectations (Curie's law at high $T$, saturation at low $T$).",
        "answer": "TBD"
    }
]


In [40]:
def ask_with_reasoning(question: str, reasoning_level: str = "low"):
    """
    Send a question to the OpenAI model with a given reasoning effort level.
    
    Parameters:
        question (str): The input prompt.
        reasoning_level (str): The reasoning effort level ("low", "medium", or "high").
    
    Returns:
        answer (str): The model's answer.
        total_tokens (int): Total token usage.
        reasoning_tokens (Optional[int]): Number of tokens used for reasoning (if available).
        response_time (float): The time (in seconds) taken for the API call.
    """
    start_time = time.time()
    response = client.chat.completions.create(
        model=MODEL_NAME,
        reasoning_effort=reasoning_level,
        messages=[{"role": "user", "content": question}]
    )
    end_time = time.time()
    answer = response.choices[0].message.content.strip()
    usage = response.usage
    total_tokens = usage.total_tokens
    reasoning_tokens = None
    if usage.completion_tokens_details.reasoning_tokens is not None:
        reasoning_tokens = usage.completion_tokens_details.reasoning_tokens
    return answer, total_tokens, reasoning_tokens, (end_time - start_time)


In [41]:
def evaluate_questions(questions, reasoning_levels=["low", "medium", "high"], check_correct=False):
    """
    Evaluate a list of questions using multiple reasoning effort levels.
    
    Parameters:
        questions (list of dict): Each dict should have keys 'question' and optionally 'answer'.
        reasoning_levels (list): List of reasoning levels to test.
        check_correct (bool): If True, a simple heuristic checks if the model's answer
                              matches the expected answer.
    
    Returns:
        DataFrame: Contains results for each question and reasoning level.
    """
    results = []
    for item in questions:
        question_text = item["question"]
        expected = item.get("answer", None)
        for level in reasoning_levels:
            try:
                answer, total_tokens, reasoning_tokens, response_time = ask_with_reasoning(question_text, reasoning_level=level)
                correct = None
                if check_correct and expected is not None:
                    # Normalize for a simple substring match
                    ans_norm = answer.lower().strip()
                    exp_norm = str(expected).lower().strip()
                    correct = (exp_norm in ans_norm or ans_norm in exp_norm)
                results.append({
                    "question": question_text,
                    "expected": expected,
                    "level": level,
                    "model_answer": answer,
                    "correct": correct,
                    "total_tokens": total_tokens,
                    "reasoning_tokens": reasoning_tokens,
                    "response_time": response_time
                })
            except Exception as e:
                logging.error(f"Error processing question '{question_text}' at level '{level}': {e}")
                results.append({
                    "question": question_text,
                    "expected": expected,
                    "level": level,
                    "model_answer": None,
                    "correct": False,
                    "total_tokens": None,
                    "reasoning_tokens": None,
                    "response_time": None
                })
    return pd.DataFrame(results)

In [42]:
print("Evaluating sample questions with different reasoning efforts...")
df_sample = evaluate_questions(sample_questions, check_correct=True)
print(df_sample.head())

Evaluating sample questions with different reasoning efforts...
                                            question   expected   level  \
0  A bat and a ball cost $1.10 in total. The bat ...       0.05     low   
1  A bat and a ball cost $1.10 in total. The bat ...       0.05  medium   
2  A bat and a ball cost $1.10 in total. The bat ...       0.05    high   
3  If 5 machines take 5 minutes to make 5 widgets...  5 minutes     low   
4  If 5 machines take 5 minutes to make 5 widgets...  5 minutes  medium   

                                        model_answer  correct  total_tokens  \
0  Let the cost of the ball be x dollars. Since t...     True           179   
1  Let the cost of the ball be x dollars. Then th...     True           427   
2  Let's denote the cost of the ball as x dollars...     True           622   
3  The key is to understand the production rate o...     True           190   
4  If 5 machines take 5 minutes to produce 5 widg...     True           290   

   reasoni

In [43]:
df_sample

Unnamed: 0,question,expected,level,model_answer,correct,total_tokens,reasoning_tokens,response_time
0,A bat and a ball cost $1.10 in total. The bat ...,0.05,low,Let the cost of the ball be x dollars. Since t...,True,179,0,3.968576
1,A bat and a ball cost $1.10 in total. The bat ...,0.05,medium,Let the cost of the ball be x dollars. Then th...,True,427,256,2.51918
2,A bat and a ball cost $1.10 in total. The bat ...,0.05,high,Let's denote the cost of the ball as x dollars...,True,622,448,10.968315
3,If 5 machines take 5 minutes to make 5 widgets...,5 minutes,low,The key is to understand the production rate o...,True,190,64,1.507613
4,If 5 machines take 5 minutes to make 5 widgets...,5 minutes,medium,If 5 machines take 5 minutes to produce 5 widg...,True,290,192,2.657307
5,If 5 machines take 5 minutes to make 5 widgets...,5 minutes,high,If 5 machines take 5 minutes to make 5 widgets...,True,683,576,4.773492
6,The $n=2$ level of the hydrogen atom is fourfo...,TBD,low,We wish to find the first‐order energy shifts ...,False,2463,576,12.124599
7,The $n=2$ level of the hydrogen atom is fourfo...,TBD,medium,We start by noting that the four n = 2 states ...,False,3018,1536,17.955862
8,The $n=2$ level of the hydrogen atom is fourfo...,TBD,high,We wish to find the first‐order energy shifts ...,False,8662,6976,49.791513
9,Consider $N$ non-interacting spin-$rac{1}{2}$...,TBD,low,We start by noting that each spin-½ particle i...,False,1169,128,7.308833


In [45]:
# Aggregate and summarize token usage and response time by reasoning level
 
summary = df_sample.groupby("level").agg({
"total_tokens": "mean",
"reasoning_tokens": "mean",
"response_time": "mean"}).reset_index()

summary.rename(columns={
    "total_tokens": "avg_total_tokens", 
    "reasoning_tokens": "avg_reasoning_tokens",
    "response_time": "avg_response_time"
}, inplace=True)

print("\nSummary for sample questions:")
summary


Summary for sample questions:


Unnamed: 0,level,avg_total_tokens,avg_reasoning_tokens,avg_response_time
0,high,3123.75,2352.0,21.228043
1,low,1000.25,192.0,6.227405
2,medium,1478.0,768.0,8.664308


#### -------------------------------
#### Part 2: Evaluate Dataset Questions (ARC-Challenge)
#### -------------------------------

In [69]:
!pip install datasets
from datasets import load_dataset

ds = load_dataset("allenai/ai2_arc", "ARC-Challenge")

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-19.0.0-cp311-cp311-macosx_12_0_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Using cached dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Using cached xxhash-3.5.0-cp311-cp311-macosx_10_9_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Using cached multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Using cached fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m 

README.md:   0%|          | 0.00/9.00k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/190k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/204k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/55.7k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1119 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1172 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/299 [00:00<?, ? examples/s]

In [70]:
ds

DatasetDict({
    train: Dataset({
        features: ['id', 'question', 'choices', 'answerKey'],
        num_rows: 1119
    })
    test: Dataset({
        features: ['id', 'question', 'choices', 'answerKey'],
        num_rows: 1172
    })
    validation: Dataset({
        features: ['id', 'question', 'choices', 'answerKey'],
        num_rows: 299
    })
})

In [88]:
df = pd.DataFrame(ds["train"])
df.head()


Unnamed: 0,id,question,choices,answerKey
0,Mercury_SC_415702,George wants to warm his hands quickly by rubb...,"{'text': ['dry palms', 'wet palms', 'palms cov...",A
1,MCAS_2009_5_6516,Which of the following statements best explain...,"{'text': ['The refrigerator door is smooth.', ...",B
2,Mercury_7233695,A fold observed in layers of sedimentary rock ...,"{'text': ['cooling of flowing magma.', 'conver...",B
3,Mercury_7041615,Which of these do scientists offer as the most...,"{'text': ['worldwide disease', 'global mountai...",D
4,Mercury_7041860,A boat is acted on by a river current flowing ...,"{'text': ['west', 'east', 'north', 'south'], '...",B


In [92]:
results = []  # to accumulate results for each question and level

for item in range(2):
    q_text = str(df.iloc[item].question) + "\n" + "choices: " + str(df.iloc[item].choices)
    #q_choices = df.iloc[item]["question_choices"]
    expected = df.iloc[item].answerKey
    for level in ["low", "medium", "high"]:
        try:
            answer, total_tokens, reasoning_tokens = ask_with_reasoning(q_text, reasoning_level=level)
            # Check correctness (simple check: does the answer contain the expected substring or exactly match).
            # For numeric or single-word answers, exact match is fine. For sentences, we'll use substring or a small logic.
            correct = False
            # Normalize answer and expected for comparison:
            ans_norm = answer.lower().strip()
            exp_norm = str(expected).lower().strip()
            # Simple heuristic: if expected answer is contained in the model answer text, we count it as correct.
            if exp_norm in ans_norm or ans_norm in exp_norm:
                correct = True
            # Record the result
            results.append({
                "question": q_text,
                "level": level,
                "model_answer": answer,
                "correct": correct,
                "total_tokens": total_tokens,
                "reasoning_tokens": reasoning_tokens
            })
        except TypeError as e:
            print(f"Error processing question '{q_text}' at level '{level}': {e}")
            results.append({
                "question": q_text,
                "level": level,
                "model_answer": None,
                "correct": False,
                "total_tokens": None,
                "reasoning_tokens": None 
            })

# Convert results to DataFrame for analysis
df_results = pd.DataFrame(results)
df_results.head()





Unnamed: 0,question,level,model_answer,correct,total_tokens,reasoning_tokens
0,George wants to warm his hands quickly by rubb...,low,Rubbing dry skin creates a higher coefficient ...,True,242,64
1,George wants to warm his hands quickly by rubb...,medium,The answer is A: dry palms.\n\nExplanation: Wh...,True,473,320
2,George wants to warm his hands quickly by rubb...,high,Rubbing two surfaces together produces heat du...,True,936,768
3,Which of the following statements best explain...,low,The best explanation is that the refrigerator ...,True,132,0
4,Which of the following statements best explain...,medium,The best explanation is: B. The refrigerator d...,True,205,64


In [93]:
df_results

Unnamed: 0,question,level,model_answer,correct,total_tokens,reasoning_tokens
0,George wants to warm his hands quickly by rubb...,low,Rubbing dry skin creates a higher coefficient ...,True,242,64
1,George wants to warm his hands quickly by rubb...,medium,The answer is A: dry palms.\n\nExplanation: Wh...,True,473,320
2,George wants to warm his hands quickly by rubb...,high,Rubbing two surfaces together produces heat du...,True,936,768
3,Which of the following statements best explain...,low,The best explanation is that the refrigerator ...,True,132,0
4,Which of the following statements best explain...,medium,The best explanation is: B. The refrigerator d...,True,205,64
5,Which of the following statements best explain...,high,The correct answer is B: The refrigerator door...,True,459,320


In [95]:
# aggregate the results by reasoning level
df_results.groupby("level").agg({
    "correct": "mean",
    "total_tokens": "mean",
    "reasoning_tokens": "mean"
}).reset_index()


Unnamed: 0,level,correct,total_tokens,reasoning_tokens
0,high,1.0,697.5,544.0
1,low,1.0,187.0,32.0
2,medium,1.0,339.0,192.0


### Prompting practices:

 - Keep your prompt clear. For example, include context, choices (if applicable), and the question.
 - Use a consistent format to help the model interpret the query.


### Reasoning Effort Levels:

 - "low": Minimal reasoning. Fast response time with lower token usage.
 - "medium": A balance between speed and detailed reasoning.
 - "high": Extensive reasoning, potentially yielding more accurate or nuanced answers, but at the cost of higher token usage and slower responses.

### Response Time and Token Usage:

 - Monitor response time to gauge performance. For high-throughput applications, a lower reasoning effort might be preferable.
 - Analyze total_tokens and reasoning_tokens to optimize cost and model behavior.

#### Logging and Error Handling:

 - Use try/except blocks to catch issues when processing multiple queries.
 - Logging errors can help you diagnose issues in real-world applications.
