# Welcome to the 'Reasoning with LLMs Workshop!

## Introduction

In this workshop we will explore reasoning with LLMs. 
[More detailed intro is TBD]

First, let's do the neccessary imports:

In [1]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM


## LLM Reasoning Fundamentals & Probabilistic Nature

LLMs are fundamentally probabilistic models that predict the next token based on the preceding sequence (autoregressive generation).

They learn complex conditional probability distributions over sequences of tokens from vast text corpora.
This probabilistic nature is the source of their generative capabilities and creativity, but also their unpredictability and potential for hallucination.

During generation, at each step, the model outputs logits (unnormalized log-probabilities) for every token in its vocabulary. These logits are converted to probabilities (typically via softmax).
Sampling vs. Greedy: while greedy decoding picks the single most probable token, sampling methods (using temperature, top-k, top-p) draw from the probability distribution, allowing for diversity.
Let's try these out with a simple (and small) GPT-2 model.

In [2]:

model_id = 'gpt2'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

prompt = "Once upon a time in a land far, far away, there lived a"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
# Generate text
outputs = model.generate(
    **inputs,
    max_length=50,
    do_sample=True,
    top_k=50,
    top_p=0.95,
    temperature=0.7,
    output_scores=True,
    return_dict_in_generate=True,
    num_return_sequences=1
)


2025-05-11 13:32:38.928431: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-05-11 13:32:38.935740: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1746963158.944182   75700 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1746963158.946754   75700 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-05-11 13:32:38.956244: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instr

In [3]:
outputs[2]

((tensor([[[[-1.2210e+00,  1.5021e+00,  6.1730e-01,  ..., -1.1935e+00,
             -8.7997e-01,  1.1420e+00],
            [-1.8341e+00,  2.1130e+00,  2.3510e+00,  ..., -4.8222e-01,
             -1.0945e+00,  1.5408e+00],
            [-2.3145e+00,  2.7101e+00,  1.5073e+00,  ..., -5.7809e-01,
             -1.9292e+00,  2.2634e+00],
            ...,
            [-1.9849e+00,  2.3091e+00,  2.4815e+00,  ..., -2.4145e-01,
             -2.0052e+00,  1.2417e+00],
            [-2.0536e+00,  1.4504e+00,  2.9853e+00,  ..., -7.0710e-01,
             -2.4352e+00,  1.9423e+00],
            [-1.4235e+00,  1.5343e+00,  3.2570e+00,  ..., -2.3549e+00,
             -2.5238e+00,  1.6361e+00]],
  
           [[-3.8355e-01,  8.8163e-01, -3.5473e-01,  ...,  2.1880e-01,
              1.7623e+00, -2.8943e-01],
            [-1.3115e+00, -7.4689e-01, -8.2298e-01,  ...,  2.3210e+00,
              3.2909e+00, -1.0974e-01],
            [-7.1399e-01, -1.6105e+00, -2.9748e+00,  ..., -1.6920e+00,
              4.4462

In [4]:
# outputs.scores is a tuple of tensors, one per generation step
# Each tensor shape: (batch_size, vocab_size)
print(f"Number of generation steps: {len(outputs.scores)}")
output_scores = outputs.scores # Logits for each generation step

# Convert logits to probabilities
# Get probabilities for the first generated token
first_step_probs = torch.softmax(output_scores[0], dim=-1)
top_k_probs, top_k_indices = torch.topk(first_step_probs, k=5)

print("\nTop 5 tokens and probabilities for the first generated token:")
for i in range(5):
    token = tokenizer.decode(top_k_indices[0, i])
    prob = top_k_probs[0, i].item()
    print(f"- Token: '{token}', Probability: {prob:.4f}")

Number of generation steps: 35

Top 5 tokens and probabilities for the first generated token:
- Token: ' man', Probability: 0.6906
- Token: ' great', Probability: 0.0463
- Token: ' king', Probability: 0.0276
- Token: ' woman', Probability: 0.0266
- Token: ' mighty', Probability: 0.0209


In [5]:
# Decode the generated text
generated_text = tokenizer.decode(outputs[0][0], skip_special_tokens=True)
print(generated_text)

Once upon a time in a land far, far away, there lived a man who, as a child, used to wear a long, white, and gold-colored hat. It was a hat that he wore with his hands. He said,


Now, let's try a model capable model that can follow instructions

In [7]:
model_id = "google/gemma-3-1b-it" #"Qwen/Qwen2-0.5B-Instruct"  
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token_id = tokenizer.eos_token_id

In [None]:
# Standard Prompt
prompt = "Natalia sold 48 pastries in the morning and 23 pastries in the afternoon. She baked a total of 100 pastries. How many pastries did she have left? Answer with just a number in the following format <answer>Number</answer> and then stop. Example: <answer>5</answer>"

# Prepare the input
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
# Generate text
outputs = model.generate(
    **inputs,
    max_new_tokens = 350,
    do_sample=True,
    top_k=50,
    top_p=0.95,
    temperature=0.7,
    output_scores=True,
    return_dict_in_generate=True,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id # Ensure the model stops at the end of the sequence
)
# Decode the generated text
generated_text = tokenizer.decode(outputs[0][0], skip_special_tokens=True)
print("\nStandard Prompt Generated Text:")
print(generated_text)




Standard Prompt Generated Text:
Natalia sold 48 pastries in the morning and 23 pastries in the afternoon. She baked a total of 100 pastries. How many pastries did she have left? Answer with just a number in the following format <answer>Number</answer> and then stop. Example: <answer>5</answer>

Let $M$ be the number of pastries Natalia sold in the morning, and $A$ be the number of pastries Natalia sold in the afternoon.
We are given that $M = 48$ and $A = 23$. The total number of pastries sold is $M + A = 48 + 23 = 71$.
We are also given that Natalia baked a total of 100 pastries. Let $L$ be the number of pastries Natalia had left.
Then, $M + A + L = 100$.
We know that $M = 48$ and $A = 23$, so we have $48 + 23 + L = 100$.
This simplifies to $71 + L = 100$.
Subtracting 71 from both sides, we get $L = 100 - 71 = 29$.
Thus, Natalia had 29 pastries left.

Final Answer: The final answer is $\boxed{29}$
The problem states that Natalia baked a total of 100 pastries. She sold 48 pastries in 

In [87]:
# --- Few-Shot Prompt ---
fs_prompt = f"""
You will be given a math problem. Generate solution and then stop. Here is the example:
Q: A train traveled 350 miles in the first 5 hours and then 400 miles in the next 5 hours. What was the average speed of the train for the entire journey?
A: 
Length of the journey = 350 + 400 = 750 miles
<answer>75</answer>{tokenizer.eos_token}

Q: Natalia sold 48 pastries in the morning and 23 pastries in the afternoon. She baked a total of 100 pastries. How many pastries did she have left?
A:""" # Let the model complete this

# Prepare the input
inputs = tokenizer(fs_prompt, return_tensors="pt").to(model.device)
# Generate text
outputs = model.generate(
    **inputs,
    max_new_tokens = 350,
    do_sample=True,
    #top_k=50,
    #top_p=0.95,
    temperature=0.7,
    output_scores=True,
    return_dict_in_generate=True,
    #num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id, # Ensure the model stops at the end of the sequence
    
)
# Decode the generated text
generated_text = tokenizer.decode(outputs[0][0], skip_special_tokens=True)
print("\nStandard Prompt Generated Text:")
print(generated_text)



Standard Prompt Generated Text:

You will be given a math problem. Generate solution and then stop. Here is the example:
Q: A train traveled 350 miles in the first 5 hours and then 400 miles in the next 5 hours. What was the average speed of the train for the entire journey?
A: 
Length of the journey = 350 + 400 = 750 miles
<answer>75</answer>

Q: Natalia sold 48 pastries in the morning and 23 pastries in the afternoon. She baked a total of 100 pastries. How many pastries did she have left?
A: 
Length of the journey = 48 + 23 = 71
<answer>71</answer>
</pre>









In [34]:
# --- Few-Shot CoT Prompt ---
cot_prompt = f"""
You will be given a math problem. Generate solution step by step and then stop. Here is the example:
Q: A train traveled 350 miles in the first 5 hours and then 400 miles in the next 5 hours. What was the average speed of the train for the entire journey?
A: <think>
First, find the total distance traveled.
Total distance = distance in first part + distance in second part
Total distance = 350 + 400 = 750 miles
Next, find the total time taken for the journey.
Total time = time in first part + time in second part
Total time = 5 + 5 = 10 hours
Finally, calculate the average speed by dividing the total distance by the total time.
Average speed = Total distance / Total time
Average speed 1  = 750 / 10 = 75
</think>
<answer>75</answer>{tokenizer.eos_token}

Q: Natalia sold 48 pastries in the morning and 23 pastries in the afternoon. She baked a total of 100 pastries. How many pastries did she have left?
A:""" # Let the model complete this

inputs = tokenizer(cot_prompt, return_tensors="pt").to(model.device)


In [48]:

# --- Generate with CoT and get scores ---
outputs = model.generate(
    **inputs,
    max_new_tokens=200,
    output_scores=True,
    return_dict_in_generate=True,
    temperature=0.7, # Allow some variation
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id,
    
)



In [49]:
# Decode the generated text
generated_text = tokenizer.decode(outputs[0][0], skip_special_tokens=True)
print(generated_text)


You will be given a math problem. Generate solution step by step and then stop. Here is the example:
Q: A train traveled 350 miles in the first 5 hours and then 400 miles in the next 5 hours. What was the average speed of the train for the entire journey?
A: <think>
First, find the total distance traveled.
Total distance = distance in first part + distance in second part
Total distance = 350 + 400 = 750 miles
Next, find the total time taken for the journey.
Total time = time in first part + time in second part
Total time = 5 + 5 = 10 hours
Finally, calculate the average speed by dividing the total distance by the total time.
Average speed = Total distance / Total time
Average speed 1  = 750 / 10 = 75
</think>
<answer>75</answer>

Q: Natalia sold 48 pastries in the morning and 23 pastries in the afternoon. She baked a total of 100 pastries. How many pastries did she have left?
A: <think>
First, find the total number of pastries sold.
Total pastries sold = 48 + 23 = 71
Next, find the nu

We see a problem with stopping. Our model just keeps generating text and often it will just continue with generating new question.

In [50]:
result_str = """You will be given a math problem. Generate solution step by step and then stop. Here is the example:
Q: A train traveled 350 miles in the first 5 hours and then 400 miles in the next 5 hours. What was the average speed of the train for the entire journey?
A: <think>
First, find the total distance traveled.
Total distance = distance in first part + distance in second part
Total distance = 350 + 400 = 750 miles
Next, find the total time taken for the journey.
Total time = time in first part + time in second part
Total time = 5 + 5 = 10 hours
Finally, calculate the average speed by dividing the total distance by the total time.
Average speed = Total distance / Total time
Average speed 1  = 750 / 10 = 75
</think>
<answer>75</answer>

Q: Natalia sold 48 pastries in the morning and 23 pastries in the afternoon. She baked a total of 100 pastries. How many pastries did she have left?
A: <think>
First, find the total number of pastries sold.
Total pastries sold = 48 + 23 = 71
Next, find the number of pastries she had left.
Pastries left = Total pastries - Total pastries sold
Pastries left = 100 - 71 = 29
</think>
<answer>29</answer>
<answer>29</answer>"""

In [53]:
import re
pattern = r"<answer>(.*?)</answer>"
matches = re.findall(pattern, generated_text, re.DOTALL)
print("\nExtracted Answers:")
for match in matches:
    print(match.strip())

print("\nLast Answer:" + matches[-1].strip())


Extracted Answers:
75
29
29

Last Answer:29


In [54]:
def extract_answer(text):
    # Define the regex pattern to match the answer
    pattern = r"<answer>(.*?)</answer>"
    
    # Use re.findall to extract all matches
    matches = re.findall(pattern, text, re.DOTALL)
    
    # Return the last match if available
    if matches:
        return matches[-1].strip()
    else:
        return None
# Example usage
text = result_str
answer = extract_answer(text)
print("\nExtracted Answer:", answer)


Extracted Answer: 29


In [88]:
def evaluate_model_accuracy(model, tokenizer, prompt, cot_prompt, ground_truth, no_of_generations):
    """
    Evaluates model accuracy by comparing generated answers with ground truth.
    
    Args:
        model: The language model
        tokenizer: The tokenizer
        prompt: Standard prompt for the problem
        cot_prompt: Chain of thought prompt for the problem
        ground_truth: The correct answer
        no_of_generations: Number of generations to run
        
    Returns:
        dict: Dictionary containing accuracy metrics and generated answers
    """
    standard_correct = 0
    cot_correct = 0
    standard_answers = []
    cot_answers = []
    

    for i in range(no_of_generations):
        # Standard prompt generation
        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
        outputs = model.generate(
            **inputs,
            max_new_tokens = 250,
            do_sample=True,
            return_dict_in_generate=True,
            temperature=0.7,
            output_scores=True,                        
            eos_token_id=tokenizer.eos_token_id # Ensure the model stops at the end of the sequence
        )
        standard_text = tokenizer.decode(outputs[0][0], skip_special_tokens=True)
        try:
            # Extract just the number from the response
            standard_answer = extract_answer(standard_text)
            # Convert to integer
            standard_answer = int(standard_answer)
            standard_answers.append(standard_answer)            
            if standard_answer == ground_truth:
                standard_correct += 1
        except:
            standard_answers.append(None)
            
        # CoT prompt generation
        inputs = tokenizer(cot_prompt, return_tensors="pt").to(model.device)
        outputs = model.generate(
            **inputs,
            max_new_tokens=250,
            output_scores=True,
            return_dict_in_generate=True,
            temperature=0.7,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
            
        )
        cot_text = tokenizer.decode(outputs[0][0], skip_special_tokens=True)
        try:
            # Extract number between <answer> tags
            answer_text = extract_answer(cot_text)
            # Convert to integer
            cot_answer = int(answer_text)            
            cot_answers.append(cot_answer)
            if cot_answer == ground_truth:
                cot_correct += 1
        except:
            cot_answers.append(None)
    
    results = {
        "standard_accuracy": (standard_correct / no_of_generations) * 100,
        "cot_accuracy": (cot_correct / no_of_generations) * 100,
        "standard_answers": standard_answers,
        "cot_answers": cot_answers,
        "ground_truth": ground_truth,
        "total_generations": no_of_generations
    }
    
    return results

# Example usage:
#prompt = "Natalia sold 48 pastries in the morning and 23 pastries in the afternoon. She baked a total of 100 pastries. How many pastries did she have left? Answer with just a number."
ground_truth = 29  # 100 - (48 + 23) = 29

results = evaluate_model_accuracy(
    model=model,
    tokenizer=tokenizer,
    prompt=fs_prompt,
    cot_prompt=cot_prompt,
    ground_truth=ground_truth,
    no_of_generations=10
)

# Print results
print(f"\nResults after {results['total_generations']} generations:")
print(f"Standard Prompt Accuracy: {results['standard_accuracy']:.1f}%")
print(f"Chain of Thought Accuracy: {results['cot_accuracy']:.1f}%")
print("\nStandard Prompt Answers:", results['standard_answers'])
print("CoT Prompt Answers:", results['cot_answers'])
print(f"Ground Truth: {results['ground_truth']}")


Results after 10 generations:
Standard Prompt Accuracy: 50.0%
Chain of Thought Accuracy: 50.0%

Standard Prompt Answers: [40, 29, 71, 71, 71, 71, 29, 29, 29, 29]
CoT Prompt Answers: [71, 29, 0, 29, 29, 75, 29, 29, 0, 42]
Ground Truth: 29


In [93]:
from tqdm.notebook import tqdm

def evaluate_model_accuracy_batched(model, tokenizer, prompt, cot_prompt, ground_truth, no_of_generations, batch_size=4):
    """
    Evaluates model accuracy using batched processing for faster inference.
    """
    standard_correct = 0
    cot_correct = 0
    standard_answers = []
    cot_answers = []
    
    # Create batches of prompts
    standard_prompts = [prompt] * batch_size
    cot_prompts = [cot_prompt] * batch_size
    
    for i in tqdm(range(0, no_of_generations, batch_size), desc="Processing Batches"):
        current_batch_size = min(batch_size, no_of_generations - i)
        
        # Standard prompt batch generation
        inputs = tokenizer(standard_prompts[:current_batch_size], 
                         return_tensors="pt", 
                         padding=True).to(model.device)
        outputs = model.generate(
            **inputs,
            max_new_tokens=250,
            do_sample=True,
            return_dict_in_generate=True,
            temperature=0.7,
            output_scores=True,
            pad_token_id=tokenizer.eos_token_id
        )
        
        # Process standard batch results
        for j in range(current_batch_size):
            try:
                text = tokenizer.decode(outputs[0][j], skip_special_tokens=True)
                answer = int(extract_answer(text))
                standard_answers.append(answer)
                if answer == ground_truth:
                    standard_correct += 1
            except:
                standard_answers.append(None)
        
        # CoT prompt batch generation
        inputs = tokenizer(cot_prompts[:current_batch_size], 
                         return_tensors="pt", 
                         padding=True).to(model.device)
        outputs = model.generate(
            **inputs,
            max_new_tokens=250,
            do_sample=True,
            return_dict_in_generate=True,
            temperature=0.7,
            output_scores=True,
            pad_token_id=tokenizer.eos_token_id
        )
        
        # Process CoT batch results
        for j in range(current_batch_size):
            try:
                text = tokenizer.decode(outputs[0][j], skip_special_tokens=True)
                answer = int(extract_answer(text))
                cot_answers.append(answer)
                if answer == ground_truth:
                    cot_correct += 1
            except:
                cot_answers.append(None)

    results = {
        "standard_accuracy": (standard_correct / no_of_generations) * 100,
        "cot_accuracy": (cot_correct / no_of_generations) * 100,
        "standard_answers": standard_answers,
        "cot_answers": cot_answers,
        "ground_truth": ground_truth,
        "total_generations": no_of_generations
    }
    
    return results

In [96]:
results = evaluate_model_accuracy_batched(
    model=model,
    tokenizer=tokenizer,
    prompt=fs_prompt,
    cot_prompt=cot_prompt,
    ground_truth=ground_truth,
    no_of_generations=32,
    batch_size=16  # Adjust based on your GPU memory
)

Processing Batches:   0%|          | 0/2 [00:00<?, ?it/s]

In [97]:
# Print results
print(f"\nResults after {results['total_generations']} generations:")
print(f"Standard Prompt Accuracy: {results['standard_accuracy']:.1f}%")
print(f"Chain of Thought Accuracy: {results['cot_accuracy']:.1f}%")
print("\nStandard Prompt Answers:", results['standard_answers'])
print("CoT Prompt Answers:", results['cot_answers'])
print(f"Ground Truth: {results['ground_truth']}")


Results after 32 generations:
Standard Prompt Accuracy: 28.1%
Chain of Thought Accuracy: 56.2%

Standard Prompt Answers: [71, 71, 71, 60, 29, 71, 29, 71, None, 300, 71, 40, 29, 71, 71, 29, 14, None, 29, 71, 71, 100, 29, 0, None, 29, 71, 29, 29, 100, 71, 71]
CoT Prompt Answers: [29, 0, 23, 29, 29, 0, 0, 29, 29, 29, 0, 29, 29, 29, 0, 29, 23, 0, 29, 0, 0, 29, 0, 29, 29, 29, 29, 29, 29, 0, 0, 77]
Ground Truth: 29
