<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Solve Challenging Problems using Advanced Prompt Engineering**


Estimated time needed: **60** minutes


Slow and steady wins the race! This is true for many things, including LLM model outputs. While fast inference is desirable in certain situations like building real-time chatbots or live translation apps, we want our models to take their time and reason deliberately through complex problems. For example, when conducting scientific research, analyzing financial data for investment decisions, or solving challenging mathematical proofs, accuracy matters far more than speed—and it's worth allocating extra compute time to get the answer right.

In this project, you'll learn about Test-Time Compute by experimenting with how strategies to increase compute can improve results for reasoning through mathematical problems.


## __Table of Contents__

<ol>
    <li><a href="#Objectives">Objectives</a></li>
    <li>
        <a href="#Setup">Setup</a>
        <ol>
            <li><a href="#Installing-Required-Libraries">Installing Required Libraries</a></li>
            <li><a href="#Importing-Required-Libraries">Importing Required Libraries</a></li>
        </ol>
    </li>
    <li>
        <a href="#What-is-Test-Time-Compute?">What is Test-Time Compute?</a>
    </li>
    <li>
        <a href="#Chain-of-Thought-(CoT)-Prompting">Chain-of-Thought (CoT) Prompting</a>
    </li>
    <li>
        <a href="#CoT-Experiment-Results">CoT Experiment Results</a>
    </li>
    <li><a href="#Best-of-N-Sampling">Best-of-N Sampling</li>
    <li><a href="#Best-of-N-Experiment-Results">Best-of-N Experiment Results</a></li>
    <li><a href="#Self-verification-Strategy">Self-verification Strategy</a></li>
    <li><a href="#Self-verification-Experiment-Results">Self-verification Experiment Results</a></li>
    <li>
         <a href="#Exercises">Exercises</a>
        <ol>
            <li><a href="#Exercise-1---Implementing-Chain-of-Thought-Prompting">Exercise 1 - Implementing Chain-of-Thought Prompting</a></li>
            <li><a href="#Exercise-2---Best-of-N-Sampling-with-Custom-Selection">Exercise 2 - Best-of-N Sampling with Custom Selection</a></li>
        </ol>
    </li>
</ol>


## Objectives

After completing this lab, you will be able to:

* Explain the concept of **Test-Time Compute** and its role in improving model performance during inference.
* Apply **Chain-of-Thought prompting** to enable step-by-step reasoning in large language models.
* Use **Best-of-N sampling** to generate multiple candidate outputs and select the most accurate one.
* Implement **Self-verification** strategies to allow models to check and refine their own answers.
* Analyze how increasing test-time compute affects solution accuracy, consistency, and efficiency.


----


## Setup


For this lab, we will be using the following libraries:

*   [`langchain`](https://www.langchain.com/) for running and managing inference with OpenAI models.
*   [`pandas`](https://pandas.pydata.org/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2021-01-01) for managing the data.
*   [`numpy`](https://numpy.org/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2021-01-01) for mathematical operations.
*   [`sklearn`](https://scikit-learn.org/stable/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2021-01-01) for machine learning and machine-learning-pipeline related functions.
*   [`seaborn`](https://seaborn.pydata.org/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2021-01-01) for visualizing the data.
*   [`matplotlib`](https://matplotlib.org/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2021-01-01) for additional plotting tools.


### Installing Required Libraries

Installing the libraries may take up to 10 minutes.


In [1]:
%pip install langchain-openai==0.3.30  | tail -n 1
%pip install langchain==0.3.27 | tail -n 1

Successfully installed jiter-0.13.0 langchain-core-0.3.83 langchain-openai-0.3.30 langsmith-0.7.4 openai-1.109.1 orjson-3.11.7 regex-2026.1.15 requests-toolbelt-1.0.0 tiktoken-0.12.0 uuid-utils-0.14.0 xxhash-3.6.0
Note: you may need to restart the kernel to use updated packages.
Successfully installed langchain-0.3.27 langchain-text-splitters-0.3.11
Note: you may need to restart the kernel to use updated packages.


**Note**: running this next cell will cause a pop up notification, but it's nothing to worry about! Just close it and keep progressing through the project. 


In [2]:
from IPython import get_ipython 
get_ipython().kernel.do_shutdown(restart=True)

{'status': 'ok', 'restart': True}

## Importing Required Libraries


In [None]:
import warnings
warnings.filterwarnings('ignore')

from langchain_openai import OpenAI
import re
from collections import Counter

## What is Test-Time Compute? 

**Test-Time Compute** refers to the use of additional computational resources (time, tokens, reasoning steps) during inference when an LLM is answering a query. The main idea is that giving models more "time to think" at inference can improve their performance on complex problems. 

The traditional approach to inference has been the following: the model reads the prompt → generates the answer immediately (with an emphasis increasing speed of inference). The **Test-Time Compute** approach is: the model reads the prompt → engages in extended reasoning → produces a better answer.

This can involve several different strategies:

* **Chain-of-Thought Prompting**: step-by-step problem solving
* **Best of N Sampling**: generating several solutions and selecting the best
* **Self-verification**: checking and correcting its own work
* **Search/tree exploration**: exploring multiple solution paths

In this lab, we'll experiment with them all to test how increased compute improves results. 


---

## Chain-of-Thought (CoT) Prompting

Chain of thought prompting is basically just asking an AI to "show its work" - like a math teacher would ask you to do in school. 

Instead of jumping straight to an answer, you prompt the AI to walk through its reasoning step-by-step. So instead of asking "what's 47 x 23?" and getting just "1081", you'd ask it to break down the problem, such that the response might look like "First, I'll multiply 47 by 20, which gives me 940. Then 47 by 3, which is 141. Adding those together: 940 + 141 = 1081."

The cool thing is that this actually makes AI models better at complex problems. When they articulate their reasoning, they're less likely to make mistakes, especially on tricky logic puzzles, math problems, or anything requiring multiple steps. 

How does this relate back to Test-Time Compute? Essentially, you can make an LLM smarter by training it on more data, OR you can let it spend more time reasoning through a problem. Chain of Thought is one way to use that extra compute as the model generates more tokens as it thinks through steps, which means more computation happening, which often leads to better answers. 


In [2]:
llm = OpenAI(name="gpt-5-mini", temperature=0.3)

problem = "A number is formed by reversing the digits of 27. Add this number to 27. What’s the sum?"

# Without CoT
print("WITHOUT Chain of Thought:")
response_direct = llm.invoke(f"Answer directly: {problem}")
print(response_direct)
print("\n" + "="*60 + "\n") # print formatting 

# With CoT
print("WITH Chain of Thought:")
cot_prompt = f"""Think step-by-step and show your reasoning:
{problem}"""
response_cot = llm.invoke(cot_prompt)
print(response_cot)

WITHOUT Chain of Thought:


The sum is 54.


WITH Chain of Thought:


Step 1: Reverse the digits of 27
27 reversed is 72.

Step 2: Add 27 and 72
27 + 72 = 99

Therefore, the sum of 27 and its reversed digits is 99.


### CoT Experiment Results

As you can see above, without the CoT prompt to think step-by-step and show the model's reasoning, the model immediately responds with the incorrect answer of 54, presumably because it just added 27 + 27. However, when the model takes the extra time to write out the reverse of the number 27, it correctly adds 27 + 72 which is 99. 

Here's what's really interesting about this example: it shows that the model isn't actually "thinking" differently in some hidden way - the act of writing out the intermediate steps literally changes the computation.
It's almost like the model can't "see" what it needs to do until it externalizes it. By forcing itself to type "the reverse of 27 is 72," it creates new information in its context that it can then use. Without that step written down, it's like the reversal never happened in the model's "mind."
This also hints at why test-time compute matters so much. When the model generates more tokens (the reverse step, the addition breakdown, etc.), it's essentially giving itself more opportunities to correct course. Each token it outputs becomes part of the context for the next token, creating a sort of scaffold for better reasoning.


---

## Best-of-N Sampling 

With Best-of-N sampling, we generate multiple answers to the same question and pick the best one. 

Essentially, instead of making a single call to the LLM, you ask it the same question N times (5 or 10 times). Each time, the model might approach the problem slightly differently or make different choices in its reasoning. Some answers will be correct, some might have mistakes. 

Then, you select the best answer from all the candidates. You can do this by:
- Picking the most common answer (majority vote)
- Using human verification to check which answer is correct
- Having another LLM judge which response is highest quality

The underlying principle is that the model is more likely to get things right than wrong but correctness isn't always 100% reliable. If you generate 10 answers and 8 of them say "42" while 2 say "41", then the answer is probably 42. 

This is the Test-Time Compute principle in practice. You're using more computational resources during inference (running the model multiple times instead of once) to improve the quality of the final answer.


In [3]:
llm = OpenAI(name="gpt-5-mini", temperature=0.7)  # higher temp for variety

# a problem where we want to sample multiple solutions
problem = """
A farmer has 10 cats. Half of them run away, and 3 return. Then half of the remaining ones have kittens, 2 each. How many cats are there now?
"""

print("=" * 60)
print("Best-of-N Sampling (N=7)")
print("=" * 60)
print(f"Problem: {problem}\n")

Best-of-N Sampling (N=7)
Problem: 
A farmer has 10 cats. Half of them run away, and 3 return. Then half of the remaining ones have kittens, 2 each. How many cats are there now?




### Generating Multiple Candidate Solutions

This step generates multiple independent solutions to the same problem by calling the language model N times (in this case, 7 times) with identical prompts. By producing several candidate answers, the system creates a diverse pool of potential solutions that may use different reasoning approaches or arrive at different conclusions, which can then be compared and evaluated to identify the most reliable answer through consensus or quality assessment.


In [4]:
# generate N candidates
N = 7
candidates = []

print("Generating candidates...\n")
for i in range(N):
    response = llm.invoke(f"Solve this step-by-step:\n{problem}")
    candidates.append(response)
    print(f"Candidate {i+1}:")
    print(response)
    print("-" * 60)

Generating candidates...

Candidate 1:

Step 1: Start with 10 cats.
Step 2: Half of them run away, so 10 divided by 2 is 5. There are now 5 cats remaining.
Step 3: 3 of the cats return, so add 3 to the current number of cats (5) to get 5+3=8 cats.
Step 4: Half of the remaining cats have kittens, so half of 8 is 4. This means that 4 cats have kittens, and each cat has 2 kittens. 4 x 2 = 8 kittens.
Step 5: Add the number of kittens (8) to the current number of cats (8) to get 8+8=16 cats in total.
Therefore, after all the changes, there are now 16 cats. 
------------------------------------------------------------
Candidate 2:

Step 1: Start with the given information.
The farmer has 10 cats.

Step 2: Half of the cats run away.
Half of 10 is 5.
There are now 10 - 5 = 5 cats left.

Step 3: 3 of the cats return.
There are now 5 + 3 = 8 cats.

Step 4: Half of the remaining cats have kittens.
Half of 8 is 4.
There are now 8 - 4 = 4 cats without kittens.

Step 5: Each of the remaining cats ha

### Selecting the Best Answer Through Voting

This code below extracts the final numerical answer from each candidate solution by using a regular expression to find all whole numbers in the response text and taking the last number as the assumed answer. By collecting all the final answers into a list, the system prepares to implement a voting mechanism where the most frequently occurring answer across all candidates is selected as the most reliable solution, based on the assumption that multiple independent reasoning paths converging on the same answer increases confidence in its correctness.


In [5]:
# simple voting/selection mechanism
# count how many times each answer appears
print("\n" + "=" * 60)
print("Selecting best answer...")
print("=" * 60)

# extract final numbers from each candidate (simple approach)
final_answers = []
for candidate in candidates:
    # look for numbers in the response
    # finds all whole numbers in the string 
    # (sequences of digits surrounded by word boundaries)
    numbers = re.findall(r'\b\d+\b', candidate)
    if numbers:
        final_answers.append(numbers[-1])  # Take the last number as the answer

print(f"\nAll answers: {final_answers}")


Selecting best answer...

All answers: ['16', '12', '24', '16', '16', '16', '12']


### Finding the Most Common Answer

The final bit of code below uses Python's `Counter` class to tally how many times each numerical answer appears across all candidates, then identifies the most frequently occurring answer as the best solution. It displays this consensus answer along with how many candidates agreed on it, and prints out the full reasoning from the first candidate that arrived at this most common answer, providing both the final result and a complete solution path that led to it.


In [6]:
# find most common answer using collections.Counter
# to count the frequencies of the final answers
answer_counts = Counter(final_answers)
most_common_answer = answer_counts.most_common(1)[0]

print(f"\nMost common answer: {most_common_answer[0]} (appeared {most_common_answer[1]}/{N} times)")
print(f"\nBest candidate (most common answer):")
# Print the first candidate with the most common answer
for i, ans in enumerate(final_answers):
    if ans == most_common_answer[0]:
        print(candidates[i])
        break


Most common answer: 16 (appeared 4/7 times)

Best candidate (most common answer):

Step 1: Start with 10 cats.
Step 2: Half of them run away, so 10 divided by 2 is 5. There are now 5 cats remaining.
Step 3: 3 of the cats return, so add 3 to the current number of cats (5) to get 5+3=8 cats.
Step 4: Half of the remaining cats have kittens, so half of 8 is 4. This means that 4 cats have kittens, and each cat has 2 kittens. 4 x 2 = 8 kittens.
Step 5: Add the number of kittens (8) to the current number of cats (8) to get 8+8=16 cats in total.
Therefore, after all the changes, there are now 16 cats. 


### Best-of-N Experiment Results

As you can see above with each inference call, there was some variation in the reasoning paths and even the final answers. By collecting N independent inference samples from the model, and selecting the most consistent final answer, we're using more computational resources to improve the accuracy of our results. 


---

## Self-verification Strategy

Self-verification is when you ask the LLM to check its own work, the same way a student might review their test answers before turning it in. In a lot of cases for both LLMs or students, they may have made a mistake on their first attempt, and upon checking over their work, find the error and are able to fix it if they take the time to do so.

Here's the process: 
1. **Generate an initial answer**: ask the model to solve a problem
2. **Verify the answer**: ask the model to review its own solution and check for errors
3. **Refine if needed**: based on the verification, produce a corrected final answer

The main idea is giving the model a chance to catch its own mistakes. This is another form of test-time compute. You're running the model multiple times on the same problem - once to solve it, once to verify it, and possibly once more to refine it. That's more computational work than a single-pass answer, but it often leads to more accurate results. You're trading compute for quality.


In [7]:
# set-up
llm = OpenAI(name="gpt-5-mini", temperature=0.3)

problem = """
problem = "A number is formed by reversing the digits of 27. Add this number to 27. What’s the sum?"
"""

print("=" * 60)
print("Self-Verification Strategy")
print("=" * 60)
print(f"Problem: {problem}\n")

Self-Verification Strategy
Problem: 
problem = "A number is formed by reversing the digits of 27. Add this number to 27. What’s the sum?"




### How the Model Gets Its Initial Response

The code sends a straightforward prompt to the language model using `llm.invoke()`, which combines the instruction "Solve this problem:" with the actual problem text. The model processes this prompt in a single pass and returns its first attempt at a solution without any self-reflection or verification - this initial answer may contain errors or incomplete reasoning that will be refined in later steps.


In [12]:
# getting the initial answer

print("STEP 1: Initial Answer")
print("-" * 60)

initial_prompt = f"Solve this problem:\n{problem}"
initial_answer = llm.invoke(initial_prompt)
print(initial_answer)

STEP 1: Initial Answer
------------------------------------------------------------

The number formed by reversing the digits of 27 is 72. Adding this number to 27 results in a sum of 99. Therefore, the sum is 99.


### Self-Verification Step

The verification prompt asks the model to act as its own critic by presenting the original problem alongside the initial answer and requesting a thorough review. The model checks the solution by examining the logical reasoning for flaws, verifying that any calculations are performed correctly, and ensuring the final answer is sensible in the context of the problem - essentially forcing the model to step back and evaluate its own work from a fresh perspective.


In [13]:
# using self-verification

print("STEP 2: Self-Verification")
print("-" * 60)

verification_prompt = f"""
Here's a problem and a proposed solution. Check if the solution is correct.
If you find any errors, explain what's wrong and provide the correct answer.

Problem: {problem}

Proposed Solution:
{initial_answer}

Is this solution correct? Verify by:
1. Checking the logic
2. Verifying the calculations
3. Confirming the final answer makes sense
"""

verification = llm.invoke(verification_prompt)
print(verification)

STEP 2: Self-Verification
------------------------------------------------------------

1. The logic of the solution is correct. The problem asks to reverse the digits of 27, which results in the number 72. Adding this number to 27 gives a sum of 99.
2. The calculations are also correct. 27 + 72 = 99.
3. The final answer of 99 makes sense as it is the sum of two numbers (27 and 72) that are formed by the same digits, but in reverse order. This is a common pattern in math and the answer aligns with this pattern. 


### Final Refinement Step

The refinement prompt provides the model with the complete context: the original problem, its initial attempt, and the verification feedback identifying any errors or confirming correctness. By synthesizing all three pieces of information, the model can incorporate the insights from the verification step to either correct mistakes found in the initial answer or confidently reaffirm the solution if it was already correct, producing a final refined response.


In [14]:
# getting the final refined answer

print("STEP 3: Final Refined Answer")
print("-" * 60)

refinement_prompt = f"""
Based on this verification, provide the final correct answer:

Original problem: {problem}

Initial attempt: {initial_answer}

Verification: {verification}

What is the final, correct answer?
"""

final_answer = llm.invoke(refinement_prompt)
print(final_answer)

STEP 3: Final Refined Answer
------------------------------------------------------------

The final, correct answer is 99.


### Self-verification Experiment Results

As we can see above in our verification strategy experiment, when we make a second call to the LLM to run a self-verification check, it reasons through and sometimes finds that the initial answer to the problem was incorrect, and corrects the answer. 

The verification prompt forces more careful, step-by-step reasoning by breaking down the problem into explicit sub-questions. This structured approach helps the model catch errors that it made during the initial approach. 

---


## Exercises

Now let's extend what we've learned with some hands-on exercises: 


### Exercise 1 - Implementing Chain-of-Thought Prompting 

In the lesson, we saw how Chain-of-Thought prompting helps models reason through problems step-by-step. Now it's your turn to implement a CoT prompt from scratch! A useful thing to know given how often we use LLMs as tools. 

**Instructions**: Complete the code below to implement a Chain-of-Thought prompting function. The function should guide the model through a structured reasoning process for math word problems. 


In [None]:
def chain_of_thought_solve(problem: str, llm) -> str:
    """
    Solve a problem using Chain-of-Thought prompting.
    
    Args:
        problem (str): The problem to solve
        llm: The language model instance
        
    Returns:
        str: The model's response with step-by-step reasoning
    """
    # TODO 1: Create a prompt that asks the model to solve the problem step-by-step
    # Your prompt should explicitly instruct the model to:
    # - Break down the problem
    # - Show each calculation step
    # - Provide a final answer
    
    cot_prompt = """
    # YOUR CODE HERE
    """
    
    # TODO 2: Invoke the LLM with your Chain-of-Thought prompt
    response = # YOUR CODE HERE
    
    return response

# Test your implementation
problem = "Sarah has 3 boxes. Each box contains 4 bags. Each bag has 5 marbles. How many marbles does Sarah have in total?"
result = chain_of_thought_solve(problem, llm)
print(result)

<details>
    <summary>Click here for Solution</summary>
    
```python
def chain_of_thought_solve(problem: str, llm) -> str:
    """
    Solve a problem using Chain-of-Thought prompting.
    Args:
    problem (str): The problem to solve
    llm: The language model instance
    
    Returns:
    str: The model's response with step-by-step reasoning
    """
    cot_prompt = f"""
    Solve the following problem step-by-step. Show your reasoning at each step.

    Problem: {problem}

    Please:
    1. Break down what information is given
    2. Identify what needs to be calculated
    3. Show each calculation step with explanation
    4. Provide the final answer

    Let's think through this step by step:
    """

    response = llm.invoke(cot_prompt)

    return response


### Exercise 2 - Best-of-N Sampling with Custom Selection 

In this exercise, you'll implement a Best-of-N sampling strategy with a custom scoring function to select the best answer. 

**Instructions**: Complete the function below to generate N candidate solutions and select the best one based on a scoring criteria. You'll need to implement both the generation loop and selection logic. 


In [None]:
def best_of_n_sample(problem: str, llm, n: int = 3, temperature: float = 0.7) -> dict:
    """
    Generate N solutions and return the best one based on confidence scoring.
    
    Args:
        problem (str): The problem to solve
        llm: The language model instance
        n (int): Number of candidate solutions to generate
        temperature (float): Temperature for sampling diversity
        
    Returns:
        dict: Dictionary with 'best_answer', 'all_answers', and 'scores'
    """
    # TODO 1: Create a prompt that asks the model to solve the problem
    # and rate its confidence (0-10) in the answer
    prompt = f"""
    # YOUR CODE HERE - Create a prompt that:
    # 1. Asks to solve: {problem}
    # 2. Requests a confidence score (0-10)
    # 3. Uses a format like "Answer: X\nConfidence: Y"
    """
    
    candidates = []
    
    # TODO 2: Generate N candidate solutions
    # Hint: You'll need to call llm.invoke() N times
    for i in range(n):
        # YOUR CODE HERE
        pass
    
    # TODO 3: Parse confidence scores from each candidate
    # Extract the confidence score from responses (look for "Confidence: X")
    scores = []
    for candidate in candidates:
        # YOUR CODE HERE - extract confidence score
        # Hint: You might use string methods like .split() or regex
        pass
    
    # TODO 4: Select the candidate with the highest confidence score
    best_idx = # YOUR CODE HERE
    
    return {
        'best_answer': candidates[best_idx],
        'all_answers': candidates,
        'scores': scores
    }

# Test your implementation
problem = "What is 15% of 240?"
result = best_of_n_sample(problem, llm, n=3)
print("Best Answer:", result['best_answer'])
print("\nAll Scores:", result['scores'])

<details>
    <summary>Click here for Solution</summary>
    
```python
def best_of_n_sample(problem: str, llm, n: int = 3, temperature: float = 0.7) -> dict:
    """
    Generate N solutions and return the best one based on confidence scoring.

    Args:
    problem (str): The problem to solve
    llm: The language model instance
    n (int): Number of candidate solutions to generate
    temperature (float): Temperature for sampling diversity
    
    Returns:
        dict: Dictionary with 'best_answer', 'all_answers', and 'scores'
    """
    prompt = f"""
    Solve this problem: {problem}

    After providing your answer, rate your confidence in the solution on a scale of 0-10.

    Format your response as:
    Answer: [your solution]
    Confidence: [0-10]
    """

    candidates = []

    # Generate N candidate solutions
    for i in range(n):
        response = llm.invoke(prompt)
        candidates.append(response)

    # Parse confidence scores from each candidate
    scores = []
    for candidate in candidates:
        try:
            # Extract confidence score
            confidence_line = [line for line in candidate.split('\n') if 'Confidence:' in line][0]
            score = float(confidence_line.split(':')[1].strip())
            scores.append(score)
        except:
            scores.append(0)  # Default score if parsing fails

    # Select the candidate with the highest confidence score
    best_idx = scores.index(max(scores))

    return {
        'best_answer': candidates[best_idx],
        'all_answers': candidates,
        'scores': scores
    }


---

## Summary 

Congratulations! You’ve explored the powerful concept of **Test-Time Compute (TTC)** and how it enhances reasoning and accuracy in large language models during inference. In this lab, you learned how to:

* **Explain Test-Time Compute**: Understand how allocating more compute at inference can improve reasoning quality.
* **Apply Chain-of-Thought Prompting**: Enable models to reason step-by-step through complex problems.
* **Use Best-of-N Sampling**: Generate multiple candidate responses and select the most accurate or consistent one.
* **Implement Self-Verification**: Allow models to review, critique, and refine their own outputs.
* **Analyze Performance Trade-offs**: Evaluate how increasing test-time compute impacts accuracy, consistency, and efficiency.

### Next Steps:

There's still a lot we can improve upon! If you're up for a challenge, take a look at the below next steps to continue learning:

* Experiment with **search-based reasoning** (e.g., Tree of Thoughts) to explore multiple reasoning paths.
* Integrate **self-consistency scoring** or **ensemble reasoning** to further refine outputs.
* Visualize reasoning traces to understand how the model’s internal thought process evolves with more compute.
* Compare efficiency metrics (latency vs. accuracy) to find the optimal compute budget for your task.


![Congrats on completing this project image](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/vg_icJK9Wf6TUzkeM0FZfw/IMG-0640.jpg)


## Authors


[Tenzin Migmar](https://author.skills.network/instructors/tenzin_migmar): Hi, I'm Tenzin. I'm a data scientist intern at IBM interested in applying machine learning to solve difficult problems. Prior to joining IBM, I worked as a research assistant on projects exploring perspectivism and personalization within large language models. In my free time, I enjoy recreational programming and learning to cook new recipes.


### Other Contributors


[Abdul Fatir](https://author.skills.network/instructors/abdul_fatir): Abdul specializes in Data Science, Machine Learning, and AI. He has deep expertise in understanding how the latest technologies work, and their applications. Feel free to contact him with questions about this project or any other AI/ML topics.


Copyright © 2025 IBM Corporation. All rights reserved.
