<a href="https://colab.research.google.com/github/Nebius-Academy/LLM-Engineering-Essentials/blob/main/topic2/r.2_inference_time_compute_solutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#LLM Engineering Essentials R.2. Inference-time compute

# Practice solutions

## Getting ready

In [None]:
!pip install -q openai

In [None]:
import os

with open("nebius_api_key", "r") as file:
    nebius_api_key = file.read().strip()

os.environ["NEBIUS_API_KEY"] = nebius_api_key

We'll be calling APIs quite often in this notebook, so let's define a shortcut fuction to avoid repeating all the code:

In [None]:
from openai import OpenAI

nebius_client = OpenAI(
    base_url="https://api.studio.nebius.ai/v1/",
    api_key=os.environ.get("NEBIUS_API_KEY"),
)

llama_8b_model = "meta-llama/Meta-Llama-3.1-8B-Instruct"

def prettify_string(text, max_line_length=80):
    """Prints a string with line breaks at spaces to prevent horizontal scrolling.

    Args:
        text: The string to print.
        max_line_length: The maximum length of each line.
    """

    output_lines = []
    lines = text.split("\n")
    for line in lines:
        current_line = ""
        words = line.split()
        for word in words:
            if len(current_line) + len(word) + 1 <= max_line_length:
                current_line += word + " "
            else:
                output_lines.append(current_line.strip())
                current_line = word + " "
        output_lines.append(current_line.strip())  # Append the last line
    return "\n".join(output_lines)

def answer_with_llm(prompt: str,
                    system_prompt="You are a helpful assistant",
                    max_tokens=512,
                    client=nebius_client,
                    model=llama_8b_model,
                    prettify=True,
                    temperature=None) -> str:

    messages = []

    if system_prompt:
        messages.append(
            {
                "role": "system",
                "content": system_prompt
            }
        )

    messages.append(
        {
            "role": "user",
            "content": prompt
        }
    )

    completion = client.chat.completions.create(
        model=model,
        messages=messages,
        max_tokens=max_tokens,
        temperature=temperature
    )

    if prettify:
        return prettify_string(completion.choices[0].message.content)
    else:
        return completion.choices[0].message.content

# Practice, Part 4: Confidence as a Synthetic PRM

Training **Process Reward Models (PRMs)** is challenging, and only a few such models are available on Hugging Face—none of which are ideal. Therefore, having a **model-free** method for estimating solutions would be beneficial. One simple surrogate for process reward to consider is **confidence**.

In the [LLM Inference Parameters notebook](https://colab.research.google.com/github/Nebius-Academy/LLM-Engineering-Essentials/blob/main/topic1/1.6_llm_inference_parameters.ipynb), we discussed that LLMs exhibit varying levels of confidence in their generated outputs:

<center>
<img src="https://drive.google.com/uc?export=view&id=12k5EFzMZAcHntuJZBZwbm6NKqJZ1OF3l" width=600 />
</center>

The left image illustrates a case where the LLM is almost certain to generate "LLM," while the right image shows a scenario where the model is less confident in its output. While uncertainty can be valuable in creative writing, it may indicate confusion - or even hallucinations - in mathematical problem-solving. Thus, for math and logical reasoning tasks, it is reasonable to assume that **solutions generated with higher confidence are more likely to be correct**.

### Simple approach: using top predicted probability

With this in mind, we suggest modifying the **Beam Search** algorithm to evaluate partial solutions based on their **mean confidence**. Confidence can be estimated using the **mean top predicted log probability**, calculated as:

$$\frac{1}{\mathrm{n\_steps}}\sum_{i=1}^{\mathrm{n\_steps}}\log\left(\mbox{Top token probability predicted at step $i$}\right)$$

The top probability can be obtained by calling `client.chat.completions.create` with `logprobs=True` and extracting `completion.choices[0].logprobs`.

Although this approach is fairly simplistic, it may still be effective. A higher top probability implies lower probabilities for alternative tokens, indicating greater confidence in the top prediction.

### A fancier approach: Negative Mean Entropy

A more robust method involves using **negative mean entropy**. [Entropy](https://en.wikipedia.org/wiki/Entropy_(information_theory)) quantifies the uncertainty of a probability distribution. For next-token generation, it is calculated as:

$$-\sum_{w\in\mbox{Vocab}}\widehat{p}_{w}\log{ \widehat{p}_{w} },$$

where $\widehat{p}_{w}$ represents the predicted probability of token $w$. Entropy behaves as follows:

- $0$ when one token has a probability of 1 while all others have 0 (**absolute certainty**).
- Maximum when all tokens have equal probabilities (**absolute uncertainty**).

Thus, solutions with **lower entropy** are more confidently generated.

Unfortunately, OpenAI's API only provides the top-5 token probabilities, limiting direct entropy calculation. However, entropy can still be estimated using these top-5 probabilities. So, you can also try this, but we recommend you to start with using only the top probability.

**Solution**

In [None]:
import heapq
from typing import List, Dict, Tuple, Optional
from openai import OpenAI

class LLMClient:
    """Wrapper for OpenAI-compatible API clients with consistent interface."""
    def __init__(
        self,
        client: OpenAI,
        model: str,
        default_temperature: float = 0.0,
        default_max_tokens: int = 1024,
        system_prompt: Optional[str] = None
    ):
        self.client = client
        self.model = model
        self.default_temperature = default_temperature
        self.default_max_tokens = default_max_tokens
        self.system_prompt = system_prompt

    def generate(
        self,
        prompt: str,
        temperature: Optional[float] = None,
        max_tokens: Optional[int] = None,
        system_prompt: Optional[str] = None,
        logprobs: bool = False,
        top_logprobs: Optional[int] = None
    ) -> Tuple[str, Optional[Dict]]:
        """Generate completion with consistent interface across different LLM providers.
        Returns the completion and logprobs if requested."""
        messages = []

        # Use provided system prompt or fall back to default
        current_system_prompt = system_prompt or self.system_prompt
        if current_system_prompt:
            messages.append({"role": "system", "content": current_system_prompt})

        messages.append({"role": "user", "content": prompt})

        # Add parameters for logprobs
        params = {
            "model": self.model,
            "messages": messages,
            "temperature": temperature if temperature is not None else self.default_temperature,
            "max_tokens": max_tokens if max_tokens is not None else self.default_max_tokens
        }

        # Add logprobs params if requested
        if logprobs:
            params["logprobs"] = True
            if top_logprobs is not None:
                params["top_logprobs"] = top_logprobs

        completion = self.client.chat.completions.create(**params)

        content = completion.choices[0].message.content

        # Extract logprobs if available
        logprob_data = None
        if logprobs and hasattr(completion.choices[0], "logprobs"):
            logprob_data = completion.choices[0].logprobs

        return content, logprob_data

class ConfidenceBasedBeamSearch:
    """Beam search implementation using intrinsic token probability as confidence measure."""
    def __init__(
        self,
        llm_client: LLMClient,
        beam_width: int = 2,
        max_steps: int = 10
    ):
        self.llm_client = llm_client
        self.beam_width = beam_width
        self.max_steps = max_steps

    def generate_next_steps_with_confidence(
        self,
        prompt: str,
        partial_solution: str,
        num_continuations: int
    ) -> List[Tuple[str, float]]:
        """Generate next possible steps using LLM and return with confidence scores."""
        if partial_solution:
            message = f"""You are an expert math problem solver. Given a math problem and a partial solution, generate the next logical step.
Keep the step concise and focused on one specific calculation or logical deduction.

Problem:
{prompt}

Current partial solution:
{partial_solution}

Generate the next step in the solution. Only output:
- the next step
- #ANSWER: followed by your answer, if you can determine the final answer. If you output #ANSWER:, you need to output the actual answer after it.
Don't output anything else!
Keep the whole new step of the single line.
If you need to write formulas, use latex $ markup to format them."""
        else:
            message = f"""You are an expert math problem solver. Given a math problem, generate the first step of the solution.
Keep the step concise and focused on one specific calculation or logical deduction.

Problem:
{prompt}

Generate the first step in the solution. Only output:
- the first step,
- #ANSWER: followed by your answer, if you can determine the final answer. If you output #ANSWER:, you need to output the actual answer after it.
Don't output anything else!
Keep the whole first step of the single line.
If you need to write formulas, use latex $ markup to format them."""

        responses_with_confidence = []
        for _ in range(num_continuations):
            # Request generation with logprobs to get confidence data
            response, logprob_data = self.llm_client.generate(
                message,
                logprobs=True,
                top_logprobs=1   # Get logprobs for the top token at each position
            )

            # Calculate confidence score as the average of the max probabilities
            # Higher value is better (closer to 0 as logprobs are negative)
            confidence_score = self.calculate_confidence_score(logprob_data)

            responses_with_confidence.append((response.strip(), confidence_score))

        return responses_with_confidence

    def calculate_confidence_score(self, logprob_data) -> float:
        """Calculate confidence score from logprob data.
        Uses the average probability of the most likely token at each position.
        """
        if not logprob_data or not hasattr(logprob_data, "content"):
            return -10.0  # Default low confidence if data not available

        # Extract top logprobs from each token position
        token_logprobs = []
        for token_info in logprob_data.content:
            if hasattr(token_info, "logprob"):
                token_logprobs.append(token_info.logprob)
            elif hasattr(token_info, "top_logprobs") and token_info.top_logprobs:
                # Get the highest probability token
                max_logprob = max(token_info.top_logprobs.values())
                token_logprobs.append(max_logprob)

        # Calculate average logprob if we have data
        if token_logprobs:
            return sum(token_logprobs) / len(token_logprobs)
        return -10.0  # Default low confidence

    def beam_search(self, prompt: str, verbose: bool = False) -> List[Tuple[float, str]]:
        """Perform beam search to find the best solution using intrinsic confidence."""
        # Initialize beam with empty solutions
        current_beam = [(0.0, "", False)]  # (confidence_score, solution, is_finalized)

        # Get initial steps with confidence scores
        initial_continuations = self.generate_next_steps_with_confidence(prompt, None, self.beam_width)

        # Initialize beam with scored initial steps
        candidates = []
        for continuation, confidence_score in initial_continuations:
            is_finalized = "#ANSWER:" in continuation
            candidates.append((-confidence_score, continuation, is_finalized))  # Negative for max-heap
            if verbose:
                print(f"\nInitial step (confidence {confidence_score:.4f}):")
                print(f"{continuation}\n")

        # Select top-k candidates for initial beam
        heapq.heapify(candidates)
        current_beam = [(-confidence, solution, is_finalized)
                       for confidence, solution, is_finalized in heapq.nsmallest(self.beam_width, candidates)]

        # Beam search iterations
        step = 0
        while step < self.max_steps:
            # Check if all solutions are finalized
            if all(is_finalized for _, _, is_finalized in current_beam):
                break

            if verbose:
                print(f"\n=== Step {step + 1} ===")
            candidates = []

            # Keep finalized solutions and generate continuations for unfinished ones
            for confidence, partial_solution, is_finalized in current_beam:
                if is_finalized:
                    # Keep finalized solutions in candidates without modification
                    candidates.append((-confidence, partial_solution, True))
                else:
                    # Generate continuations only for unfinished solutions
                    continuations = self.generate_next_steps_with_confidence(
                        prompt,
                        partial_solution,
                        self.beam_width
                    )

                    # Evaluate each continuation
                    for continuation, new_confidence in continuations:
                        new_solution = partial_solution + "\n\n" + continuation if partial_solution else continuation
                        is_finished = "#ANSWER:" in continuation
                        candidates.append((-new_confidence, new_solution, is_finished))
                        if verbose:
                            print(f"\nCandidate (confidence {new_confidence:.4f}):")
                            print(f"{continuation}\n")

            # Select top-k candidates for next beam
            heapq.heapify(candidates)
            current_beam = [(-confidence, solution, is_finalized)
                          for confidence, solution, is_finalized in heapq.nsmallest(self.beam_width, candidates)]

            if verbose:
                print("\nSelected for next beam:")
                for confidence, solution, is_finalized in current_beam:
                    status = "FINALIZED" if is_finalized else "IN PROGRESS"
                    print(f"\nConfidence: {confidence:.4f} [{status}]")
                    print(f"{solution}\n")

            step += 1

        # Return all solutions (now guaranteed to include any finalized ones)
        return [(confidence, solution) for confidence, solution, _ in current_beam]

In [None]:
client = OpenAI(
    base_url="https://api.studio.nebius.ai/v1/",
    api_key=os.environ.get("NEBIUS_API_KEY"),
)

llm_client = LLMClient(
        client=client,
        model="meta-llama/Meta-Llama-3.1-70B-Instruct",
        default_temperature=1,
        default_max_tokens=8192
    )

beam_search = ConfidenceBasedBeamSearch(
        llm_client=llm_client,
        beam_width=2,
        max_steps=20
    )

# problem = "If x^2 + y^2 = 25 and x + y = 7, find the value of x - y."
problem = "Inside a circle, two parallel chords are 6 units apart. One chord has length 14 and the other has length 10. Find the radius of the circle."

results = beam_search.beam_search(problem, verbose=True)


Initial step (confidence -0.7402):
Draw a diagram and draw a radius perpendicular to each chord, which intersects the center of the circle, creating a right triangle with the radius, half the difference of the chords (2), and half the length of one of the chords (7 or 5).


Initial step (confidence -0.4629):
Draw a perpendicular line from the center of the circle to the midpoint of each chord, and denote the length of this line by $h$, the radius of the circle by $r$, and the distance from the center of the circle to the midpoint of the longer chord by $x$, where $x = r - h$.


=== Step 1 ===

Candidate (confidence -0.2106):
Using the Pythagorean theorem, we can write the equation for the longer chord as $x^2 + 7^2 = r^2$ and for the shorter chord as $(x+6)^2 + 5^2 = r^2$.


Candidate (confidence -0.2514):
Using the Pythagorean Theorem, we can relate $r$, $x$, and $h$ to the half-lengths of the chords: $(r-h)^2 + 7^2 = r^2$ and $(r-h+6)^2 + 5^2 = r^2$.


Candidate (confidence -0.2037)