<a href="https://colab.research.google.com/github/toe-mas-koe-vik/AI_Finance/blob/main/Copy_of_2025_AIFF_Reasoning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Test Time Scaling Inference using CoT and PoT on Llama Model**

### This Notebook was presented at the [Second Artificial Intelligence and The Future of Finance Conference](https://www.scheller.gatech.edu/events/ai-future-of-finance-conference/index.html)
 - #### **Author: Agam Shah**, **Edited by: Michael Galarnyk**
 - March 20, 2025


## Introduction

Training a state of the art large language model (LLM) from scratch can cost [millions of dollars](https://www.forbes.com/sites/craigsmith/2023/09/08/what-large-models-cost-you--there-is-no-free-ai-lunch/?sh=5b4457a4af7a). This is one major reason in some cases why we will try improving performance at inference cost. This is often something google colab or your local machine can do.

This notebook covers the following:
* Basic package installation
* Model import from Hugging Face
* Chain-of-Thought (CoT) with LLM
* Program-of-Thought (PoT) with LLM

## Install and Import Libraries

In [None]:
!pip install transformers==4.48.3
!pip install torch==2.6.0
!pip install accelerate==1.3.0



In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Importing Llama-3.2-3B-Instruct from Hugging Face

This part walks through how to download and load the **[Llama-3.2-3B-Instruct](https://huggingface.co/unsloth/Llama-3.2-3B-Instruct)** model from Hugging Face using `transformers`.

Llama-3.2-3B-Instruct is a **3B parameter instruction-tuned model**, designed for conversational AI and general-purpose NLP tasks. It's optimized for **efficient inference**, making it ideal for use cases that require fast responses with low cost.


## Model Download and Setup

In [None]:
# select model from HF
model_name = "unsloth/Llama-3.2-3B-Instruct"

# Setup tokenizer for the model
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load model as object
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    # load_in_8bit=True, # Load the model with 8-bit version to save memory
    device_map="auto",
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/54.7k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/945 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/6.43G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

### Notes:
- This setup **automatically selects the best device** (CPU/GPU) for running the model.
- Using `device_map="auto"` ensures that the model is efficiently distributed across available hardware.
- For **lower-end GPUs**, enabling **8-bit or 4-bit** quantization significantly reduces memory usage, but can hurt performance.

## Non-CoT (Non-Chain of Thought) Prompting

Non-CoT prompting refers to **directly asking the model a question without explicitly guiding it through intermediate reasoning steps**.

In this example, we ask the model to **count the number of occurrences of the letter "r" in the word "raspberry."** The model is expected to generate an immediate answer without additional reasoning steps.

In [None]:
# Define the prompt without step-by-step reasoning
prompt = "Count number of r in raspberry?"

# Tokenize the prompt and move it to the appropriate device (GPU if available)
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

# Generate a response without sampling (deterministic output)
with torch.no_grad():  # Disables gradient calculations to speed up inference
    outputs = model.generate(
        **inputs,
        max_new_tokens=100,  # Limit the response length
        do_sample=False  # Ensures a deterministic output
    )

# Decode the model's output to a readable string
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Print the generated response
print(generated_text)

AssertionError: Torch not compiled with CUDA enabled

## CoT (Chain of Thought) Prompt Example  

Chain of Thought (CoT) prompting encourages the model to **explicitly reason through the problem step by step** before arriving at an answer. This approach is particularly useful for:
- **Mathematical reasoning**  
- **Logical problems**  
- **Multi-step tasks** like counting characters in a word  

By instructing the model to **carefully analyze each character in "raspberry" one by one**, we can improve its accuracy compared to the **Non-CoT** method.

### How CoT Enhances the Response  

1. **Explicit Reasoning:**  
   - The model is instructed to process **each character individually** instead of attempting to answer immediately.  
   - This minimizes errors in counting tasks.  

2. **Increased Token Limit (`max_new_tokens=500`)**  
   - Allows the model to fully elaborate on its reasoning before reaching a final answer.  

In [None]:
# Define the prompt with explicit step-by-step reasoning instructions
prompt = "Count number of r in raspberry? " \
         "Carefully reason step-by-step, character by character, seperating each character."

# Tokenize the prompt and move it to the appropriate device (GPU if available)
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

# Generate a response with a longer token limit to allow detailed reasoning
with torch.no_grad():  # Disables gradient calculations for efficiency
    outputs = model.generate(
        **inputs,
        max_new_tokens=500,  # Allow enough space for step-by-step reasoning
        do_sample=False  # Ensures deterministic output
    )

# Decode the model's output into readable text
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Print the generated response
print(generated_text)

Count number of r in raspberry? Carefully reason step-by-step, character by character, seperating each character. 1. R
2. A
3. S
4. P
5. B
6. E
7. R
8. R
9. Y
10. (space)
11. (space)
12. (space)
13. Y

Count the number of R's in the word "raspberry". There are 3 R's. The answer is 3.


### Comparing Non-CoT vs. CoT  

| Prompt Type | Expected Behavior | Common Issues |
|-------------|------------------|--------------|
| **Non-CoT** | Immediate answer (e.g., "2") | May miscount or hallucinate |
| **CoT** | Step-by-step reasoning leading to "3" | Longer response time, more tokens used |

## Non-PoT (Non-Program of Thought) Prompting  

In **Non-PoT prompting**, we ask the model to **compute a numerical answer directly**, relying on its **pre-trained knowledge** rather than using explicit code execution.  

In this example, we ask:  

**"How many days are there between February 5, 2023, and March 25, 2024?"**  

This requires:  
✅ **Understanding date calculations**  
✅ **Accounting for leap years (2024 is a leap year)**  
✅ **Performing precise subtraction of days across months and years**  

However, without the ability to **execute code** for verification, LLMs may **hallucinate incorrect numbers** or struggle with date-based arithmetic.


In [None]:
# Define the prompt that asks for a direct numerical computation
prompt = "How many days are there between February 5, 2023, and March 25, 2024?"

# Tokenize the prompt and move it to the appropriate device (GPU if available)
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

# Generate a response with a longer token limit for a detailed answer
with torch.no_grad():  # Disables gradient calculations to improve efficiency
    outputs = model.generate(
        **inputs,
        max_new_tokens=500,  # Allow space for reasoning
        do_sample=False  # Ensures deterministic output
    )

# Decode the model's output into readable text
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Print the generated response
print(generated_text)

How many days are there between February 5, 2023, and March 25, 2024??
## Step 1: Determine the number of days in February 2023
February 2023 has 28 days.

## Step 2: Determine the number of days in March 2023
March 2023 has 31 days.

## Step 3: Determine the number of days from February 5, 2023, to the end of February 2023
Since February 5, 2023, is the 5th day of the month, there are 28 - 5 = 23 days remaining in February 2023.

## Step 4: Determine the number of days in March 2024
March 2024 has 31 days.

## Step 5: Determine the number of days from March 1, 2024, to March 25, 2024
Since March 25, 2024, is the 25th day of the month, there are 25 - 1 = 24 days in March 2024.

## Step 6: Calculate the total number of days between February 5, 2023, and March 25, 2024
Total days = days in February 2023 + days in March 2023 + days in March 2024 = 23 + 31 + 24 = 78.

The final answer is: $\boxed{78}$


## PoT (Program of Thought) Prompting  

Program of Thought (PoT) prompting can **enhance numerical accuracy** by instructing the model to **generate Python code** to compute results instead of relying on memorized approximations.  

### Why PoT is Better for Finance & Numerical Tasks  

**To mitigate numerical errors**, we can use **Program of Thought (PoT) prompting**, which enables the model to **execute Python code** for precise calculations instead of relying on memory-based estimates.

In this example, we ask the model:  

**"How many days are there between February 5, 2023, and March 25, 2024? Give a simple Python code to calculate it."**  

**Instead of guessing a number, the model generates Python code that can be executed for improved accuracy.**

In [None]:
prompt = "How many days are there between February 5, 2023, and March 25, 2024? " \
         "Give a simple Python code to calculate it. "

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=200, do_sample=False)

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(generated_text)

How many days are there between February 5, 2023, and March 25, 2024? Give a simple Python code to calculate it.  The number of days between two dates can be calculated by subtracting the day of the first date from the day of the second date and adding the difference in months and days between the two dates.

Here is a simple Python code snippet that calculates the number of days between two dates:

```python
from datetime import datetime

# Define the two dates
date1 = datetime(2023, 2, 5)
date2 = datetime(2024, 3, 25)

# Calculate the difference in days
days = (date2 - date1).days

print(f"The number of days between February 5, 2023, and March 25, 2024, is {days}.")
```

However, the above code does not take into account the difference in months and days between the two dates. To calculate the number of days between two dates, we need to subtract the day of the first date from the day of the second date and add the difference in


### Executing the Model-Generated Code  

Now, let's **execute** the Python code generated by the model to obtain the exact number of days.

In [None]:
from datetime import datetime

# Define the two dates
date1 = datetime(2023, 2, 5)
date2 = datetime(2024, 3, 25)

# Calculate the difference in days
days = (date2 - date1).days

print(f"The number of days between February 5, 2023, and March 25, 2024, is {days}.")


The number of days between February 5, 2023, and March 25, 2024, is 414.


### Why PoT is More Reliable  

| Prompt Type | Approach | Accuracy | Hallucination Risk |
|-------------|---------|----------|--------------------|
| **Non-PoT** | Memory-based estimation | ❌ May be incorrect | 🔴 High |
| **PoT** | Code execution | ✅ Exact result | 🟢 Near-zero |

**PoT ensures correctness** in finance, risk modeling, and date calculations.  
**It eliminates hallucination** by deferring computation to a trusted source (Python).  
**It enhances interpretability** since we can review and verify the generated code.

# **Key Takeaways**
## Two ways to improve reasoning
 - ### Learning time scaling
   - RL helps achieve this
   - Examples: OpenAI o1, and L1
   - Advantage: No post-hoc effort required
 - ### Inference time scaling
   - User helps achieve this
   - Examples: CoT, PoT, etc.
   - Advantage: Human in the loop

