# The Efficacy of Reasoning in AI Models
This notebook is designed to investigate the following research questions:
1. Do all tasks benefit from reasoning?
2. Do different models show varying benefits from reasoning?
3. Do different reasoning approaches (e.g., Chain of Thought, Program of Thought) yield different benefits?
4. Is there a task-approach fit? Do certain tasks benefit more from specific reasoning methods?
5. What is the cost-benefit tradeoff for each approach and task?
6. Can we predict the need for reasoning based on the input prompt alone?

## How to Use This Notebook:

**Setup**: Run the first code cell to install necessary libraries and enter your API keys.

**Configuration**: Use the interactive widgets in the "Experiment Configuration" section to select the provider, model, hyperparameters, and reasoning approach.

**Data**: The notebook loads a placeholder dataset. Replace it with your actual Hugging Face dataset and perform any other preprocessing/concatenation.

**Pipeline**: Run the "inference" cell to setup the pipeline. No changes required.

**Execution**: Run the "Execute Experiment" cell to start the process.

**Results**: The results will be collected and displayed in a table, which you can then download as a CSV file.

## Setup: Installing Libraries and API Keys
Install all the required Python libraries for interacting with Hugging Face and the various API providers. It will also prompt you to enter the necessary API keys.

In [None]:
# @title 1. Setup: Install Libraries and Authenticate
!pip install -q -U transformers
!pip install -q -U accelerate
!pip install -q -U bitsandbytes
!pip install -q openai
!pip install -q cohere
!pip install -q anthropic
!pip install -q ipywidgets
!pip install -q tqdm
!pip install -q huggingface_hub

import os
import time
import pandas as pd
from getpass import getpass
import ipywidgets as widgets
from IPython.display import display, clear_output

api_keys = {}
print("Please enter your API keys. Press Enter if you don't have a key for a specific service.")
api_keys['anthropic'] = getpass('Enter your Anthropic API Key: ')
api_keys['cohere'] = getpass('Enter your Cohere API Key: ')
api_keys['openrouter'] = getpass('Enter your OpenRouter API Key: ')
api_keys['huggingface'] = getpass('Enter your Huggingface API Key: ')
print("Setup Complete!")

## Experiment Configuration
Use the dropdown menus and sliders below to configure the experiment. The list of models will dynamically update based on the provider you choose.

In [None]:
# @title 2. Experiment Configuration
MODELS = {
    'huggingface': ["google/gemma-2-2b-it","mistralai/Mistral-7B-Instruct-v0.2","meta-llama/Meta-Llama-3-8B-Instruct"],
    'anthropic': ["claude-opus-4-1-20250805","claude-sonnet-4-20250514","claude-3-5-haiku-20241022"],
    'cohere': ["command-r-plus","command-r","command-light"],
    'openrouter': ["openai/gpt-5-chat","openai/gpt-5-mini","openai/gpt-oss-120b","google/gemini-2.5-flash-lite"]
}
provider_widget = widgets.Dropdown(
    options=['huggingface', 'anthropic', 'cohere', 'openrouter'],
    value='openrouter',
    description='Provider:',
    disabled=False,
)
model_widget = widgets.Dropdown(
    options=MODELS[provider_widget.value],
    description='Model:',
    disabled=False,
)
temp_widget = widgets.FloatSlider(value=0.5, min=0.0, max=2.0, step=0.1, description='Temperature:')
max_tokens_widget = widgets.IntSlider(value=512, min=64, max=4096, step=64, description='Max Tokens:')
top_p_widget = widgets.FloatSlider(value=0.9, min=0.0, max=1.0, step=0.05, description='Top P:')
reasoning_widget = widgets.Dropdown(
    options=['None','Chain-of-Thought (CoT)','Program-of-Thought (PoT)','Reasoning-as-Planning (RAP)','Reflection','Chain-of-Verification (CoVe)','Skeleton-of-Thought (SoT)','Tree-of-Thought (ToT)','Graph-of-Thought (GoT)','ReWOO','Buffer-of-Thoughts (BoT)'],
    value='None',
    description='Reasoning:',
    disabled=False,
    style={'description_width': 'initial'}
)
def update_models_list(change):
    """Updates the model dropdown when the provider changes."""
    model_widget.options = MODELS[change.new]

provider_widget.observe(update_models_list, names='value')
print("⚙️ Please configure your experiment below:")
display(widgets.VBox([provider_widget, model_widget, temp_widget, max_tokens_widget, top_p_widget, reasoning_widget]))

⚙️ Please configure your experiment below:


VBox(children=(Dropdown(description='Provider:', index=3, options=('huggingface', 'anthropic', 'cohere', 'open…

##Data Loading
Replace "user_name" and "dataset-name" with the actual name of your dataset on the Hugging Face Hub. The dataset should ideally contain an input column (the problem/question) and an optional expected_output column for evaluation.

In [None]:
# @title 3. Load Dataset
from datasets import load_dataset
import pandas as pd
user_name = "MrLight"
dataset_name = "bbeh-eval"
try:
    hf_dataset = load_dataset(f'{user_name}/{dataset_name}', split='train').to_pandas()
    print(f"Successfully loaded {dataset_name} from Hugging Face.")
    if 'input' not in hf_dataset.columns:
        print("Warning: Your dataset should have an 'input' column.")
        hf_dataset['input'] = hf_dataset['question']
        hf_dataset=hf_dataset[0:20]
except Exception as e:
    print(f"Could not load your dataset. Using a placeholder instead. Error: {e}")
print("Dataset Preview:")
display(hf_dataset.head())

Successfully loaded bbeh-eval from Hugging Face.
Dataset Preview:


Unnamed: 0,question,answer,task,input
0,I had a collection of 42 weird items that went...,43,object_properties,I had a collection of 42 weird items that went...
1,I had a collection of 31 weird items that went...,15,object_properties,I had a collection of 31 weird items that went...
2,I had a collection of 37 weird items that went...,11,object_properties,I had a collection of 37 weird items that went...
3,I had a collection of 45 weird items that went...,12,object_properties,I had a collection of 45 weird items that went...
4,I had a collection of 47 weird items that went...,37,object_properties,I had a collection of 47 weird items that went...


## Core Inference Logic
Define functions that handle the API calls and local model inference. It selects the correct function based on your configuration. Don't make any changes.

In [None]:
# @title 4. Core Inference Functions
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from huggingface_hub import login
import re
import anthropic
import cohere
from openai import OpenAI
hf_model_loaded = None
hf_tokenizer_loaded = None
def run_inference(provider, model_name, prompt, temp, max_tokens, top_p):
    global hf_model_loaded, hf_tokenizer_loaded
    start_time = time.time()
    output, cost = "", 0.0
    try:
        if provider == 'huggingface':
            if hf_model_loaded is None or hf_model_loaded.config.name_or_path != model_name:
                if api_keys.get('huggingface'): login(api_keys['huggingface'])
                print(f"\nLoading Hugging Face model: {model_name}. This may take a while...")
                quantization_config = BitsAndBytesConfig(load_in_4bit=True)
                hf_tokenizer_loaded = AutoTokenizer.from_pretrained(model_name)
                hf_model_loaded = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=quantization_config, device_map="auto")
                print("Model loaded.")
            inputs = hf_tokenizer_loaded(prompt, return_tensors="pt").to("cuda")
            input_ids_len = inputs['input_ids'].shape[1]
            outputs = hf_model_loaded.generate(**inputs, max_new_tokens=max_tokens, temperature=temp if temp > 0 else None, top_p=top_p, do_sample=True if temp > 0 else False)
            generated_tokens = outputs[0][input_ids_len:]
            output = hf_tokenizer_loaded.decode(generated_tokens, skip_special_tokens=True)
        elif provider == 'anthropic':
            client = anthropic.Anthropic(api_key=api_keys['anthropic'])
            response = client.messages.create(model=model_name, messages=[{"role": "user", "content": prompt}], temperature=temp, max_tokens=max_tokens, top_p=top_p)
            output = response.content[0].text
        elif provider == 'cohere':
            co = cohere.Client(api_keys['cohere'])
            response = co.chat(model=model_name, message=prompt, temperature=temp, max_tokens=max_tokens, p=top_p)
            output = response.text
        elif provider == 'openrouter':
            client = OpenAI(base_url="https://openrouter.ai/api/v1", api_key=api_keys['openrouter'])
            response = client.chat.completions.create(model=model_name, messages=[{"role": "user", "content": prompt}], temperature=temp, max_tokens=max_tokens, top_p=top_p)
            output = response.choices[0].message.content
    except Exception as e:
        output = f"ERROR: {str(e)}"
    duration = time.time() - start_time
    return output, duration, cost

def execute_reasoning_pipeline(original_prompt, reasoning_type, inference_config):
    """
    Manages the full reasoning process for all 10 techniques.
    """
    total_duration, total_cost = 0, 0
    reasoning_trace = ""
    def call_inference(prompt):
        nonlocal total_duration, total_cost, reasoning_trace
        reasoning_trace += f"----------\nPROMPT:\n{prompt}\n----------\n"
        output, duration, cost = run_inference(**inference_config, prompt=prompt)
        total_duration += duration
        total_cost += cost
        reasoning_trace += f"RESPONSE ({duration:.2f}s):\n{output}\n\n"
        return output
    # 1. No Reasoning
    if reasoning_type == 'None':
        final_output = call_inference(original_prompt)
        return final_output, total_duration, total_cost, reasoning_trace
    # 2. Chain-of-Thought (CoT)
    elif reasoning_type == 'Chain-of-Thought (CoT)':
        prompt = f"Question: {original_prompt}\n\nLet's think step by step."
        final_output = call_inference(prompt)
        return final_output, total_duration, total_cost, reasoning_trace
    # 3. Program-of-Thought (PoT)
    elif reasoning_type == 'Program-of-Thought (PoT)':
        prompt = f"# Q: {original_prompt}\n# A: Write a python program to solve this.\n\nimport math\n"
        final_output = call_inference(prompt)
        return final_output, total_duration, total_cost, reasoning_trace
    # 4. Reasoning-as-Planning (RAP)
    elif reasoning_type == 'Reasoning-as-Planning (RAP)':
        # You might need to add domain details (Initial State, Goal, Actions) to your dataset's input column.
        prompt = f"""You are a planning agent. Analyze the following problem and generate a sequence of actions to solve it.
        Problem: '{original_prompt}'
        Provide the solution as a numbered list of actions (the plan).
        Plan:"""
        final_output = call_inference(prompt)
        return final_output, total_duration, total_cost, reasoning_trace
    # 5. Reflection
    elif reasoning_type == 'Reflection':
        draft = call_inference(f"Provide a detailed answer to the question: {original_prompt}")
        critique = call_inference(f"You are an expert critic. Review the following answer for the question '{original_prompt}'.\n\nAnswer: '{draft}'\n\nIdentify errors, omissions, or areas for improvement. Provide a constructive critique.")
        final_output = call_inference(f"Original Question: {original_prompt}\nDraft Answer: '{draft}'\nCritique: '{critique}'\n\nBased on the critique, provide a final, improved answer.")
        return final_output, total_duration, total_cost, reasoning_trace
    # 6. Chain-of-Verification (CoVe)
    elif reasoning_type == 'Chain-of-Verification (CoVe)':
        draft = call_inference(f"Provide a direct answer to the question: '{original_prompt}'")
        questions = call_inference(f"Based on this statement: '{draft}', generate a list of verification questions to check its factual accuracy.")
        answers = call_inference(f"Please answer these questions concisely:\n{questions}")
        final_output = call_inference(f"Original Question: {original_prompt}\nInitial Answer: '{draft}'\nVerification Q&A:\n{answers}\n\nBased on the verification, provide a final, corrected answer.")
        return final_output, total_duration, total_cost, reasoning_trace
    # 7. Skeleton-of-Thought (SoT)
    elif reasoning_type == 'Skeleton-of-Thought (SoT)':
        skeleton = call_inference(f"Generate a concise, numbered list outline (a skeleton) for a comprehensive answer to the question: '{original_prompt}'")
        outline_points = [line for line in skeleton.strip().split('\n') if re.match(r'^\s*(\d+\.|\*|-)\s+', line.strip())]
        if not outline_points: outline_points = [skeleton]
        full_answer_parts = []
        reasoning_trace += "--- Expanding Skeleton ---\n"
        for point in outline_points:
            expanded_point = call_inference(f"In the context of answering '{original_prompt}', please expand on this point in detail:\n\n{point}")
            full_answer_parts.append(expanded_point)
        final_output = "\n\n".join(full_answer_parts)
        return final_output, total_duration, total_cost, reasoning_trace
    # 8. Tree-of-Thought (ToT)
    elif reasoning_type == 'Tree-of-Thought (ToT)':
        thoughts = call_inference(f"For the question '{original_prompt}', generate three distinct and promising initial approaches to solving it. Label them Approach 1, Approach 2, and Approach 3.")
        evaluation = call_inference(f"You are an expert evaluator. Critique the following approaches for solving '{original_prompt}'. Which one is most likely to succeed and why?\n\n{thoughts}")
        final_output = call_inference(f"Original Question: '{original_prompt}'\n\nBased on this expert evaluation: '{evaluation}', expand the most promising approach into a full, detailed final answer.")
        return final_output, total_duration, total_cost, reasoning_trace
    # 9. Graph-of-Thought (GoT)
    elif reasoning_type == 'Graph-of-Thought (GoT)':
        nodes = call_inference(f"To answer '{original_prompt}', please generate several distinct ideas or pieces of information that could be relevant. These are nodes in a knowledge graph.")
        synthesis = call_inference(f"Here are several ideas related to '{original_prompt}':\n\n{nodes}\n\nFind connections, resolve contradictions, and merge these ideas into a single, coherent, synthesized paragraph.")
        final_output = call_inference(f"Based on the following synthesized understanding:\n\n'{synthesis}'\n\nProvide a comprehensive final answer to the question: '{original_prompt}'.")
        return final_output, total_duration, total_cost, reasoning_trace
    # 10. ReWOO
    elif reasoning_type == 'ReWOO':
        plan = call_inference(f"To answer '{original_prompt}', create a plan. Break the problem down and identify steps requiring external information (use placeholders like [Search(query)]) or calculations (use [Calculate(expression)]).")
        # In a real system, you'd execute the tool calls. Here, we ask the LLM to simulate the results.
        simulated_results = call_inference(f"Imagine you are a tool executor. For the following plan, predict the results of the tool calls.\n\nPlan:\n{plan}")
        final_output = call_inference(f"Using the following plan and simulated tool results, construct the final answer to the original question: '{original_prompt}'.\n\nPlan:\n{plan}\n\nResults:\n{simulated_results}")
        return final_output, total_duration, total_cost, reasoning_trace
    # 11. Buffer-of-Thoughts (BoT)
    elif reasoning_type == 'Buffer-of-Thoughts (BoT)':
        print("Executing BoT Pipeline...")
        step1 = call_inference(f"Let's start solving: '{original_prompt}'. Perform one step of reasoning and then summarize your key findings in a '[Thought Buffer]'.")
        step2 = call_inference(f"Previous reasoning:\n'{step1}'\n\nContinue from here. Perform the next step of reasoning and update the buffer with any new, important findings.")
        final_output = call_inference(f"You have performed a multi-step analysis summarized below:\n\n{step2}\n\nBased on your complete reasoning and the final thought buffer, provide the final, comprehensive answer to: '{original_prompt}'.")
        return final_output, total_duration, total_cost, reasoning_trace
    else:
        return "ERROR: Selected reasoning approach is not implemented.", 0, 0, ""
print("Core logic and all reasoning pipelines are ready.")

Core logic and all reasoning pipelines are ready.


## Execute Experiment
Running the cell below will start the experiment. It will iterate through each row of your dataset, apply the configured settings, and collect the results.

In [None]:
# @title 5. Run the Experiment
from tqdm.auto import tqdm
inference_config = {
    "provider": provider_widget.value,
    "model_name": model_widget.value,
    "temp": temp_widget.value,
    "max_tokens": max_tokens_widget.value,
    "top_p": top_p_widget.value,
}
reasoning_approach = reasoning_widget.value

results_list = []
print(f"Starting experiment with the following configuration:")
print(f"Reasoning: {reasoning_approach}")
print(f"Inference: {inference_config}")

for index, row in tqdm(hf_dataset.iterrows(), total=len(hf_dataset), desc=f"Executing {reasoning_approach} Pipeline"):
    if reasoning_approach.startswith('---'):
        continue
    original_prompt = row['input']
    model_output, duration, cost, trace = execute_reasoning_pipeline(
        original_prompt=original_prompt,
        reasoning_type=reasoning_approach,
        inference_config=inference_config
    )
    result_row = {
        "input": original_prompt,
        "model_output": model_output,
        "reasoning_trace": trace,
        "execution_time_s": round(duration, 2),
        "cost_usd": cost,
        "reasoning_approach": reasoning_approach,
        **inference_config
    }
    results_list.append(result_row)

results_df = pd.DataFrame(results_list)
print("\n\n--- Experiment Complete ---")
display(results_df)

Starting experiment with the following configuration:
Reasoning: Tree-of-Thought (ToT)
Inference: {'provider': 'cohere', 'model_name': 'command-r', 'temp': 0.5, 'max_tokens': 512, 'top_p': 0.9}


Executing Tree-of-Thought (ToT) Pipeline:   0%|          | 0/20 [00:00<?, ?it/s]



--- Experiment Complete ---


Unnamed: 0,input,model_output,reasoning_trace,execution_time_s,cost_usd,reasoning_approach,provider,model_name,temp,max_tokens,top_p
0,I had a collection of 42 weird items that went...,Here's a detailed breakdown of the items in yo...,----------\nPROMPT:\nFor the question 'I had a...,10.01,0.0,Tree-of-Thought (ToT),cohere,command-r,0.5,512,0.9
1,I had a collection of 31 weird items that went...,### Collection Overview:\nOur journey begins w...,----------\nPROMPT:\nFor the question 'I had a...,10.18,0.0,Tree-of-Thought (ToT),cohere,command-r,0.5,512,0.9
2,I had a collection of 37 weird items that went...,"Let's walk through the entire process, using t...",----------\nPROMPT:\nFor the question 'I had a...,9.93,0.0,Tree-of-Thought (ToT),cohere,command-r,0.5,512,0.9
3,I had a collection of 45 weird items that went...,### Approach 3: Initial Collection Analysis\n\...,----------\nPROMPT:\nFor the question 'I had a...,9.69,0.0,Tree-of-Thought (ToT),cohere,command-r,0.5,512,0.9
4,I had a collection of 47 weird items that went...,Here's a detailed plan and execution of using ...,----------\nPROMPT:\nFor the question 'I had a...,9.43,0.0,Tree-of-Thought (ToT),cohere,command-r,0.5,512,0.9
5,I had a collection of 50 weird items that went...,Let's walk through the entire life cycle of yo...,----------\nPROMPT:\nFor the question 'I had a...,12.12,0.0,Tree-of-Thought (ToT),cohere,command-r,0.5,512,0.9
6,I had a collection of 37 weird items that went...,### Approach 2: Use Attribute Counters\n\nWe'l...,----------\nPROMPT:\nFor the question 'I had a...,10.22,0.0,Tree-of-Thought (ToT),cohere,command-r,0.5,512,0.9
7,I had a collection of 34 weird items that went...,Here's a step-by-step breakdown of the process...,----------\nPROMPT:\nFor the question 'I had a...,10.05,0.0,Tree-of-Thought (ToT),cohere,command-r,0.5,512,0.9
8,I had a collection of 36 weird items that went...,Final Answer: Using a spreadsheet to tally the...,----------\nPROMPT:\nFor the question 'I had a...,8.7,0.0,Tree-of-Thought (ToT),cohere,command-r,0.5,512,0.9
9,I had a collection of 30 weird items that went...,Here's a step-by-step breakdown of how the spr...,----------\nPROMPT:\nFor the question 'I had a...,8.79,0.0,Tree-of-Thought (ToT),cohere,command-r,0.5,512,0.9


##Save and Download Results
Once the experiment is complete, run the final cell to save the results to a CSV file, which you can then download to your local machine and upload to huggingface.

In [None]:
# @title 6. Save Results to CSV
filename = f"{inference_config['model_name'].replace('/', '_')}_{reasoning_approach.replace(' ', '')}_{time.strftime('%Y%m%d-%H%M%S')}.csv"
results_df.to_csv(filename, index=False)
print(f"  Results saved to '{filename}'.")
try:
  from google.colab import files
  files.download(filename)
  print(f"\n  ⬇Downloading '{filename}'...")
except ImportError:
  print("\nCould not automatically trigger download. Please find the file in the Colab file explorer on the left.")

  Results saved to 'command-r_Tree-of-Thought(ToT)_20250816-114705.csv'.


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>


  ⬇Downloading 'command-r_Tree-of-Thought(ToT)_20250816-114705.csv'...
