# Scaling Test-Time Compute for Longer Thinking in LLMs

_Authored by: [Sergio Paniego](https://github.com/sergiopaniego)_

🚨 **WARNING**: This notebook is **resource-intensive** and requires substantial computational power. If you’re running this in **Colab**, it will utilize an **A100 GPU**.

---

In this recipe, we'll guide you through extending the inference time for an **Instruct LLM system** using **test-time compute** to solve more challenging problems, such as **complex math problems**. This approach, inspired by [**OpenAI o1-o3 models**](https://openai.com/index/learning-to-reason-with-llms/), demonstrates that **longer reasoning time** during inference can enhance model performance.

This technique builds on experiments shared in [this **blog post**](https://huggingface.co/spaces/HuggingFaceH4/blogpost-scaling-test-time-compute), which show that smaller models, like the **1B** and **3B Llama Instruct models**, can outperform much larger ones on the **MATH-500 benchmark** when given enough **"time to think"**. Recent research from [DeepMind](https://arxiv.org/abs/2408.03314) suggests that **test-time compute** can be scaled optimally through strategies like iterative self-refinement or using a reward model.

The blog introduces a [**new repository**](https://github.com/huggingface/search-and-learn) for running these experiments. In this recipe, we'll focus on building a **small chatbot** that engages in **longer reasoning** to tackle **harder problems** using small open models.

![Instruct LLM Methodology](https://huggingface.co/datasets/HuggingFaceH4/blogpost-images/resolve/main/methods-thumbnail.png)

## 1. Install Dependencies

Let’s start by installing the [search-and-learn](https://github.com/huggingface/search-and-learn) repository! 🚀  
This repo is designed to replicate the experimental results and is not a Python pip package. However, we can still use it to generate our system. To do so, we’ll need to install it from source with the following steps:

In [None]:
!git clone https://github.com/huggingface/search-and-learn

In [None]:
%cd hack-search-and-learn

In [None]:
!pip install -e '.[dev]'
!pip install matplotlib

Log in to Hugging Face to access [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct), as it is a gated model! 🗝️  
If you haven't previously requested access, you'll need to submit a request before proceeding.


In [2]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## 2. Setup the Large Language Model (LLM) and the Process Reward Model (PRM) 💬

As illustrated in the diagram, the system consists of an LLM that generates intermediate answers based on user input, a [PRM model](https://huggingface.co/papers/2211.14275) that evaluates and scores these answers, and a search strategy that uses the PRM feedback to guide the subsequent steps in the search process until reaching the final answer.

Let’s begin by initializing each model. For the LLM, we’ll use the [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) model, and for the PRM, we’ll use the [RLHFlow/Llama3.1-8B-PRM-Deepseek-Data](https://huggingface.co/RLHFlow/Llama3.1-8B-PRM-Deepseek-Data) model.




![system](https://huggingface.co/datasets/HuggingFaceH4/blogpost-images/resolve/main/system.png)

In [None]:
import sys
import os

project_src = "src/"

# Add it to sys.path
sys.path.append(project_src)

In [None]:
import torch
from vllm import LLM
from sal.models.reward_models import RLHFFlow

model_path="meta-llama/Llama-3.2-1B-Instruct"
prm_path="RLHFlow/Llama3.1-8B-PRM-Deepseek-Data"

llm = LLM(
    model=model_path,
    gpu_memory_utilization=0.5,  # Utilize 50% of GPU memory
    enable_prefix_caching=True,  # Optimize repeated prefix computations
    seed=42,                     # Set seed for reproducibility
)

prm = RLHFFlow(prm_path)

### 2.1 Instantiate the Question, Search Strategy, and Call the Pipeline

Now that we've set up the LLM and PRM, let's proceed by defining the question, selecting a search strategy to retrieve relevant information, and calling the pipeline to process the question through the models.

1. **Instantiate the Question**: In this step, we define the input question that the system will answer, considering the given context.

2. **Search Strategy**: The system currently supports the following search strategies: `best_of_n`, `beam_search`, and `dvts` (see diagram). For this example, we'll use `best_of_n`, but you can easily switch to any of the other strategies based on your needs. We need to define some configuration parameters for the configuration of the search strategy. You can check the full list [here](https://github.com/huggingface/search-and-learn/blob/main/src/sal/config.py).

3. **Call the Pipeline**: With the question and search strategy in place, we’ll call the inference pipeline, processing the inputs through both the LLM and PRM to generate the final answer.

![](https://huggingface.co/datasets/HuggingFaceH4/blogpost-images/resolve/main/search-strategies.png)

The first step is to clearly define the question that the system will answer. This ensures that we have a precise task for the model to tackle.

In [None]:
question_text = 'Convert the point $(0,3)$ in rectangular coordinates to polar coordinates.  Enter your answer in the form $(r,\theta),$ where $r > 0$ and $0 \le \theta < 2 \pi.$'
input_batch = {"problem": [question_text]}

Next, we define the configuration, including parameters like the number of candidate answers `(N)`, and choose the search strategy that will be used. The search strategy dictates how we explore the potential answers. In this case, we'll use `best_of_n`.

With the question and configuration in place, we use the selected search strategy to generate multiple candidate answers. These candidates are evaluated based on their relevance and quality and the final answer is returned.


In [None]:
from sal.config import Config
from sal.search import beam_search, best_of_n, dvts

config = Config()
config.n=32 # Number of answers to generate during the search

search_result = best_of_n(x=input_batch, config=config, llm=llm, prm=prm)

### 2.2 Display the Final Result

Once the pipeline has processed the question through the LLM and PRM, we can display the final result. This result will be the model's output after considering the intermediate answers and scoring them using the PRM.

Here's how to display the final answer:

In [None]:
search_result['pred'][0]

The model’s output might include special tokens, such as `<|start_header_id|>` or `<|end_header_id|>`. To make the answer more readable, we can safely remove them before displaying it to the end user.

In [None]:
formatted_output = search_result['pred'][0].replace("<|start_header_id|>assistant<|end_header_id|>\n\n", "").strip()
formatted_output

After removing any special tokens, we can display the final answer to the user. Since the answer is based on markdown, it can be rendered properly by displaying it as markdown.

In [None]:
from IPython.display import display, Markdown

display(Markdown(formatted_output))

## 3. Assembling It All! 🧑‍🏭️

Now, let's create a method that encapsulates the entire pipeline. This will allow us to easily reuse the process in future applications, making it efficient and modular.

By combining the LLM, PRM, search strategy, and result display, we can simplify the workflow and ensure that it’s reusable for other tasks or questions.

We simplify the workflow, ensuring that it’s reusable for different tasks or questions. Additionally, we’ll track the time spent on each method so that we can **understand the practical implications** of using each strategy and configuration.

Here’s how we can structure the method:

In [None]:
import time

def generate_with_search_and_learn(question, config, llm, prm, method='best_of_n'):
    """
    Generate an answer for a given question using the search-and-learn pipeline.

    Args:
    - question (str): The input question to generate an answer for.
    - config (Config): Configuration object containing parameters for search strategy.
    - llm (LLM): Pretrained large language model used for generating answers.
    - prm (RLHFFlow): Process reward model used for evaluating answers.
    - method (str): Search strategy to use. Options are 'best_of_n', 'beam_search', 'dvts'. Default is 'best_of_n'.

    Returns:
    - str: The formatted output after processing the question.
    """
    batch = {"problem": [question]}

    start_time = time.time()
    if method == 'best_of_n':
      result = best_of_n(x=batch, config=config, llm=llm, prm=prm)
    elif method == 'beam_search':
      result = beam_search(examples=batch, config=config, llm=llm, prm=prm)
    elif method == 'dvts':
      result = dvts(examples=batch, config=config, llm=llm, prm=prm)

    elapsed_time = time.time() - start_time
    print(f"\nFinished in {elapsed_time:.2f} seconds\n")

    tokenizer = llm.get_tokenizer()
    total_tokens = 0
    for completion in result['completions']:
        for comp in  completion:
            output_tokens = tokenizer.encode(comp)
            total_tokens += len(output_tokens)

    print(f"Total tokens in all completions: {total_tokens}")

    formatted_output = result['pred'][0].replace("<|start_header_id|>assistant<|end_header_id|>\n\n", "").strip()
    return formatted_output, elapsed_time, total_tokens

### ⏳  3.1 Comparing Thinking Time for Each Strategy

Let’s compare the **thinking time** of three methods: `best_of_n`, `beam_search`, and `dvts`. Each method is evaluated using the same number of answers during the search process, measuring the time spent thinking in seconds and the number of generated tokens.

In the results below, the `best_of_n` method shows the least thinking time, while the `dvts` method takes the most time. However, `best_of_n` generates more tokens due to its simpler search strategy.

| **Method**      | **Number of Answers During Search** | **Thinking Time (Seconds)** | **Generated Tokens** |
|------------------|-------------------------------------|-----------------------------|-----------------------|
| **best_of_n**    | 8                                   | 3.54                        | 3087                  |
| **beam_search**  | 8                                   | 10.06                       | 2049                  |
| **dvts**         | 8                                   | 8.46                        | 2544                  |

This comparison illustrates the trade-offs between the strategies, balancing time spent thinking and the complexity of the search process.


#### 1. **Best of n**

We’ll begin by using the `best_of_n` strategy. Here’s how to track the thinking time for this method:

In [None]:
question = 'Convert the point $(0,3)$ in rectangular coordinates to polar coordinates.  Enter your answer in the form $(r,\theta),$ where $r > 0$ and $0 \le \theta < 2 \pi.$'

config.n=8

formatted_output = generate_with_search_and_learn(question=question, config=config, llm=llm, prm=prm, method='best_of_n')

In [None]:
display(Markdown(formatted_output))

#### 2. **Beam Search**

Now, let's try using the `beam_search` strategy.

In [None]:
config.n=8
# beam search specific
config.sort_completed=True
config.filter_duplicates=True

formatted_output = generate_with_search_and_learn(question=question, config=config, llm=llm, prm=prm, method='beam_search')

In [None]:
display(Markdown(formatted_output))

#### 3. **Diverse Verifier Tree Search (DVTS)**

Finally, let's try the `dvts` strategy.

In [None]:
config.n=8
# dvts specific
config.n_beams = config.n // config.beam_width

formatted_output = generate_with_search_and_learn(question=question, config=config, llm=llm, prm=prm, method='dvts')

In [None]:
display(Markdown(formatted_output))

### 🙋 3.2 Testing the System with a Simple Question

In this final example, we’ll test the system using a straightforward question to observe how it performs in simpler cases. This allows us to verify that the system works as expected even for basic queries.

Let's try the following question:

In [None]:
question = 'What\'s the capital of Spain?'

config.n=32

formatted_output = generate_with_search_and_learn(question=question, config=config, llm=llm, prm=prm, method='best_of_n')

In [None]:
display(Markdown(formatted_output))

Even though we set a larger number of candidate answers (`N`), the time spent thinking remains relatively small (1.03 seconds and 544 generated tokens). This demonstrates the system’s ability to efficiently handle easier problems, spending less time on them, while leveraging its enhanced capabilities for more complex questions.

🏆 **We now have a fully operational pipeline** that leverages test-time compute, enabling the system to "think longer" for more complicated queries, while also maintaining fast response times for straightforward questions.

This approach ensures the system can scale its thinking time based on the task's complexity, offering an efficient and responsive solution for both simple and challenging problems.


## 4. Continuing the Journey and Resources 🧑‍🎓️

If you're eager to continue exploring, be sure to check out the original experimental [blog](https://huggingface.co/spaces/HuggingFaceH4/blogpost-scaling-test-time-compute) and all the references mentioned within it. These resources will deepen your understanding of test-time compute, its benefits, and its applications in LLMs.


Happy learning and experimenting! 🚀

## 5. Benchmarking script

In [3]:
import torch
from vllm import LLM
from sal.models.reward_models import RLHFFlow
from sal.search import beam_search, best_of_n, dvts
import sys
import os

project_src = "src/"

model_path="meta-llama/Llama-3.1-8B-Instruct"
prm_path="RLHFlow/Llama3.1-8B-PRM-Deepseek-Data"

llm = LLM(
    model=model_path,
    gpu_memory_utilization=0.5,  # Utilize 50% of GPU memory
    enable_prefix_caching=True,  # Optimize repeated prefix computations
    seed=42,                     # Set seed for reproducibility
)

prm = RLHFFlow(prm_path)



No module named 'vllm._version'
  from vllm.version import __version__ as VLLM_VERSION


config.json:   0%|          | 0.00/855 [00:00<?, ?B/s]

INFO 02-12 17:13:14 config.py:1005] Chunked prefill is enabled with max_num_batched_tokens=512.
INFO 02-12 17:13:14 llm_engine.py:237] Initializing an LLM engine (vdev) with config: model='meta-llama/Llama-3.1-8B-Instruct', speculative_config=None, tokenizer='meta-llama/Llama-3.1-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=42, served_model_name=meta-llama/

tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

INFO 02-12 17:13:17 model_runner.py:1060] Starting to load model meta-llama/Llama-3.1-8B-Instruct...
INFO 02-12 17:13:17 weight_utils.py:243] Using model weights format ['*.safetensors']


model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


INFO 02-12 17:13:40 model_runner.py:1071] Loading model weights took 14.9888 GB
INFO 02-12 17:13:40 gpu_executor.py:122] # GPU blocks: 11903, # CPU blocks: 2048
INFO 02-12 17:13:40 gpu_executor.py:126] Maximum concurrency for 131072 tokens per request: 1.45x
INFO 02-12 17:13:42 model_runner.py:1402] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 02-12 17:13:42 model_runner.py:1406] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 02-12 17:13:53 model_runner.py:1530] Graph capturing finished in 11 secs.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [4]:
import time

def generate_with_search_and_learn(question, config, llm, prm, method='best_of_n'):
    """
    Generate an answer for a given question using the search-and-learn pipeline.

    Args:
    - question (str): The input question to generate an answer for.
    - config (Config): Configuration object containing parameters for search strategy.
    - llm (LLM): Pretrained large language model used for generating answers.
    - prm (RLHFFlow): Process reward model used for evaluating answers.
    - method (str): Search strategy to use. Options are 'best_of_n', 'beam_search', 'dvts'. Default is 'best_of_n'.

    Returns:
    - str: The formatted output after processing the question.
    """
    batch = {"problem": [question]}

    start_time = time.time()
    if method == 'best_of_n':
      result = best_of_n(x=batch, config=config, llm=llm, prm=prm)
    elif method == 'beam_search':
      result = beam_search(examples=batch, config=config, llm=llm, prm=prm)
    elif method == 'dvts':
      result = dvts(examples=batch, config=config, llm=llm, prm=prm)

    elapsed_time = time.time() - start_time
    print(f"\nFinished in {elapsed_time:.2f} seconds\n")

    tokenizer = llm.get_tokenizer()
    total_tokens = 0
    for completion in result['completions']:
        for comp in  completion:
            output_tokens = tokenizer.encode(comp)
            total_tokens += len(output_tokens)

    print(f"Total tokens in all completions: {total_tokens}")

    formatted_output = result['pred'][0].replace("<|start_header_id|>assistant<|end_header_id|>\n\n", "").strip()
    return formatted_output, elapsed_time, total_tokens

In [None]:
import csv
from sal.config import Config

# Add it to sys.path
sys.path.append(project_src)

n_values = [2**i for i in range(2, 9)]
print(n_values)

# Define CSV filename
csv_filename = "search_methods_results.csv"

# Define methods
methods = ["Best-of-n", "Beam search", "Diverse verifier tree search"]

# Define headers (Each method has its own Time (s) and Total tokens)
headers = ["Number of generations"]
for method in methods:
    headers.append(f"{method} Time (s)")
    headers.append(f"{method} Total tokens")

question = 'Convert the point $(0,3)$ in rectangular coordinates to polar coordinates.  Enter your answer in the form $(r,\theta),$ where $r > 0$ and $0 \le \theta < 2 \pi.$'

with open(csv_filename, mode="w", newline="") as file:
    writer = csv.writer(file)
    writer.writerow(headers)  # Write header

    for i in n_values:
        config = Config()
        config.n = i

        row = [i]
        print(i)
    
        # best of n
        _, elapsed_time_bon, token_number_bon = generate_with_search_and_learn(question=question, config=config, llm=llm, prm=prm, method='best_of_n')
        row.append(elapsed_time_bon)
        row.append(token_number_bon)
    
        # beam search 
        config.sort_completed=True
        config.filter_duplicates=True
        _, elapsed_time_beam, token_number_beam = generate_with_search_and_learn(question=question, config=config, llm=llm, prm=prm, method='beam_search')
        row.append(elapsed_time_beam)
        row.append(token_number_beam)
    
        # dvts
        config.n_beams = config.n // config.beam_width
        _, elapsed_time_dvts, token_number_dvts = generate_with_search_and_learn(question=question, config=config, llm=llm, prm=prm, method='dvts')
        row.append(elapsed_time_dvts)
        row.append(token_number_dvts)

        writer.writerow(row)

    print(f"CSV file '{csv_filename}' has been created successfully.")

[4, 8, 16, 32, 64, 128, 256]
4

Finished in 3.84 seconds

Total tokens in all completions: 934


Beam search iterations:  10%|██████▏                                                       | 4/40 [00:08<01:15,  2.11s/it]



Finished in 8.43 seconds

Total tokens in all completions: 984


Beam search iterations:  10%|██████▏                                                       | 4/40 [00:05<00:46,  1.30s/it]



Finished in 5.22 seconds

Total tokens in all completions: 676
8

Finished in 3.95 seconds

Total tokens in all completions: 1759


Beam search iterations:  12%|███████▊                                                      | 5/40 [00:13<01:33,  2.66s/it]



Finished in 13.33 seconds

Total tokens in all completions: 1622


Beam search iterations:  10%|██████▏                                                       | 4/40 [00:07<01:10,  1.96s/it]



Finished in 7.84 seconds

Total tokens in all completions: 1492
16

Finished in 6.22 seconds

Total tokens in all completions: 3885


Beam search iterations:  20%|████████████▍                                                 | 8/40 [00:21<01:25,  2.66s/it]



Finished in 21.29 seconds

Total tokens in all completions: 2772


Beam search iterations:   8%|████▋                                                         | 3/40 [00:08<01:45,  2.84s/it]



Finished in 8.52 seconds

Total tokens in all completions: 2933
32

Finished in 7.25 seconds

Total tokens in all completions: 8056


Beam search iterations:  15%|█████████▎                                                    | 6/40 [00:25<02:26,  4.31s/it]



Finished in 25.84 seconds

Total tokens in all completions: 8519


Beam search iterations:  12%|███████▊                                                      | 5/40 [00:17<02:01,  3.47s/it]



Finished in 17.34 seconds

Total tokens in all completions: 5601
64

Finished in 9.62 seconds

Total tokens in all completions: 15138


Beam search iterations:  98%|███████████████████████████████████████████████████████████▍ | 39/40 [06:15<00:09,  9.62s/it]



Finished in 375.09 seconds

Total tokens in all completions: 15586


Beam search iterations:  12%|███████▊                                                      | 5/40 [00:24<02:49,  4.85s/it]



Finished in 24.25 seconds

Total tokens in all completions: 11232
128

Finished in 16.04 seconds

Total tokens in all completions: 29355


Beam search iterations:  15%|█████████▎                                                    | 6/40 [01:34<08:55, 15.74s/it]



Finished in 94.46 seconds

Total tokens in all completions: 28882


Beam search iterations:  15%|█████████▎                                                    | 6/40 [00:48<04:35,  8.10s/it]



Finished in 48.62 seconds

Total tokens in all completions: 25227
256

Finished in 28.64 seconds

Total tokens in all completions: 61132


Beam search iterations:  30%|██████████████████▎                                          | 12/40 [04:31<10:18, 22.09s/it]

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Load CSV file
csv_filename = "search_methods_results.csv"  # Update this with the actual CSV file path
df = pd.read_csv(csv_filename)

# Extract relevant columns
x = df["Number of generations"]
best_of_n_time = df["Best-of-n Time (s)"]
beam_search_time = df["Beam search Time (s)"]
diverse_tree_time = df["Diverse verifier tree search Time (s)"]

# Plot the data
plt.figure(figsize=(10, 6))
plt.plot(x, best_of_n_time, marker='o', label="Best-of-n")
plt.plot(x, beam_search_time, marker='s', label="Beam search")
plt.plot(x, diverse_tree_time, marker='^', label="Diverse verifier tree search")

# Labels and title
plt.xlabel("Number of Generations")
plt.ylabel("Time (s)")

plt.suptitle("Elapsed Time in All Completions vs. Number of Generations", fontsize=14)
plt.title("LLM: meta-llama/Llama-3.2-3B-Instruct, PRM: RLHFlow/Llama3.1-8B-PRM-Deepseek-Data", fontsize=10, color='gray')

plt.legend()

# Set log scale
plt.xscale("log")
plt.yscale("log")

# Ensure x-axis ticks show as integers
plt.xticks(x, labels=[str(int(val)) for val in x])  

# Grid and formatting
plt.grid(True, which="both", linestyle="--", linewidth=0.5)

# Show the plot
plt.show()


In [None]:
import matplotlib.pyplot as plt
import pandas as pd

# Load CSV file
csv_filename = "search_methods_results.csv"  # Update this with the actual CSV file path
df = pd.read_csv(csv_filename)

# Extract relevant columns
x = df["Number of generations"]
best_of_n_tokens = df["Best-of-n Total tokens"]
beam_search_tokens = df["Beam search Total tokens"]
diverse_tree_tokens = df["Diverse verifier tree search Total tokens"]

# Plot the data
plt.figure(figsize=(10, 6))
plt.plot(x, best_of_n_tokens, marker='o', label="Best-of-n")
plt.plot(x, beam_search_tokens, marker='s', label="Beam search")
plt.plot(x, diverse_tree_tokens, marker='^', label="Diverse verifier tree search")

# Labels and title
plt.xlabel("Number of Generations")
plt.ylabel("Total Tokens")
plt.suptitle("Total Tokens in All Completions vs. Number of Generations", fontsize=14)
plt.title("LLM: meta-llama/Llama-3.2-3B-Instruct, PRM: RLHFlow/Llama3.1-8B-PRM-Deepseek-Data", fontsize=10, color='gray')
plt.legend()

# Set log scale for better visualization
plt.xscale("log")
plt.yscale("log")

# Ensure x-axis ticks show as integers
plt.xticks(x, labels=[str(int(val)) for val in x])  

# Grid and formatting
plt.grid(True, which="both", linestyle="--", linewidth=0.5)

# Show the plot
plt.show()

Compare Llama 1b and Llama 3b on the basis of DVTS Time

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

# Load CSV files
llama1b_csv = "llama_3_2_1b.csv"  
llama3b_csv = "llama_3_2_3b.csv"  
llama8b_csv = "llama_3_1_8b.csv"

df_1b = pd.read_csv(llama1b_csv)
df_3b = pd.read_csv(llama3b_csv)
df_8b = pd.read_csv(llama8b_csv)

# Extract relevant data
x_1b = df_1b["Number of generations"]
time_1b = df_1b["Diverse verifier tree search Time (s)"]

x_3b = df_3b["Number of generations"]
time_3b = df_3b["Diverse verifier tree search Time (s)"]

x_8b = df_8b["Number of generations"]
time_8b = df_8b["Diverse verifier tree search Time (s)"]

# Plot comparison
plt.figure(figsize=(10, 6))
plt.plot(x_1b, time_1b, marker='o', label="Llama 3.2 1b", linestyle='-')
plt.plot(x_3b, time_3b, marker='s', label="Llama 3.2 3b", linestyle='--')
plt.plot(x_3b, time_3b, marker='x', label="Llama 3.1 8b", linestyle='---')

# Titles and Labels
plt.suptitle("Comparison of Diverse Verifier Tree Search Time", fontsize=14)
plt.title("Llama 3.2 1b vs. Llama 3.2 3b vs Llama 3.1 8b", fontsize=10, color='gray')
plt.xlabel("Number of Generations")
plt.ylabel("Diverse Verifier Tree Search Time (s)")
plt.legend()

# Ensure integer values on x-axis
plt.xticks(x_1b, labels=[str(int(val)) for val in x_1b])  

# Grid for better readability
plt.grid(True, linestyle="--", linewidth=0.5)

# Show the plot
plt.show()


Compare Llama 1b and Llama 3b on the basis of DVTS Tokens number

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

# Load CSV files
llama1b_csv = "llama_3_2_1b.csv"  
llama3b_csv = "llama_3_2_3b.csv"  
llama8b_csv = "llama_3_1_8b.csv"

df_1b = pd.read_csv(llama1b_csv)
df_3b = pd.read_csv(llama3b_csv)
df_8b = pd.read_csv(llama8b_csv)

# Extract relevant data
x_1b = df_1b["Number of generations"]
tokens_1b = df_1b["Diverse verifier tree search Total tokens"].astype(int)

x_3b = df_3b["Number of generations"]
tokens_3b = df_3b["Diverse verifier tree search Total tokens"].astype(int)

x_8b = df_8b["Number of generations"]
tokens_8b = df_8b["Diverse verifier tree search Total tokens"].astype(int)

# Plot comparison
plt.figure(figsize=(10, 6))
plt.plot(x_1b, tokens_1b, marker='o', label="Llama 1b", linestyle='-')
plt.plot(x_3b, tokens_3b, marker='s', label="Llama 3b", linestyle='--')
plt.plot(x_8b, tokens_8b, marker='x', label="Llama 8b", linestyle='--')

# Titles and Labels
plt.suptitle("Comparison of Total Tokens in Diverse Verifier Tree Search", fontsize=14)
plt.title("Llama 3.2 1b vs. Llama 3.2 3b vs Llama 3.1 8b", fontsize=10, color='gray')
plt.xlabel("Number of Generations")
plt.ylabel("Total Tokens")
plt.legend()

# Ensure integer values on x-axis
plt.xticks(x_1b, labels=[str(int(val)) for val in x_1b])  

# Ensure integer values on y-axis
y_ticks = sorted(set(tokens_1b.tolist() + tokens_3b.tolist()))
plt.yticks(y_ticks, labels=[str(val) for val in y_ticks])

# Grid for better readability
plt.grid(True, linestyle="--", linewidth=0.5)

# Show the plot
plt.show()
