# Demonstration of the Granite PRM intrinsic

This notebook shows the usage of the IO processor for the Granite Process Reward Model (PRM) intrisic, also known as the Granite 3.3 8B Instruct Math PRM LoRA. Specifically, we show how to use the PRM as a standalone evaluator to score a given response to a math question, and we also demonstrate how it can be used in a best-of-N (BoN) setting to choose the best response from a set of assistant responses to a math problem.

This notebook can run its own vLLM server to perform inference, or you can host the models on your own server.

To use your own server, set the run_server variable below to False and set appropriate values for the constants openai_base_url, openai_base_model_name and openai_lora_model_name.

For more details, please refer to the model card at https://huggingface.co/ibm-granite/granite-3.3-8b-lora-math-prm

In [None]:
from granite_io.backend.vllm_server import LocalVLLMServer

from granite_io.io.granite_3_3.input_processors.granite_3_3_input_processor import (
    Granite3Point3Inputs,
)
from granite_io import make_io_processor, make_backend
from granite_io.io.process_reward_model.best_of_n import (
    ProcessRewardModelIOProcessor,
    PRMBestOfNCompositeIOProcessor,
)

In [None]:
# constants
base_model_name = "ibm-granite/granite-3.3-8b-instruct"
lora_model_name = "ibm-granite/granite-3.3-8b-lora-math-prm"

run_server = True

In [None]:
if run_server:
    # Start by firing up a local vLLM server and connecting a backend instance to it.
    server = LocalVLLMServer(
        base_model_name, lora_adapters=[(lora_model_name, lora_model_name)]
    )
    server.wait_for_startup(200)
    lora_backend = server.make_lora_backend(lora_model_name)
    backend = server.make_backend()
else:  # if not run_server
    # Use an existing server.
    # Modify the constants here as needed.
    openai_base_url = "http://localhost:55555/v1"
    openai_api_key = "granite_intrinsics_1234"
    openai_base_model_name = base_model_name
    openai_lora_model_name = lora_model_name
    backend = make_backend(
        "openai",
        {
            "model_name": openai_base_model_name,
            "openai_base_url": openai_base_url,
            "openai_api_key": openai_api_key,
        },
    )
    lora_backend = make_backend(
        "openai",
        {
            "model_name": openai_lora_model_name,
            "openai_base_url": openai_base_url,
            "openai_api_key": openai_api_key,
        },
    )

In [None]:
# Create an example chat completion with a user question
chat_input = Granite3Point3Inputs.model_validate(
    {
        "messages": [
            {
                "role": "user",
                "content": "Weng earns $12 an hour for babysitting. "
                "Yesterday, she just did 50 minutes of babysitting. "
                "How much did she earn?",
            },
        ],
        "generate_inputs": {
            "temperature": 0.0,
            "max_tokens": 4096,
        },
    }
)
chat_input

In [None]:
# Pass the example input through Granite 3.3 to get an answer
granite_io_proc = make_io_processor("Granite 3.3", backend=backend)
result = await granite_io_proc.acreate_chat_completion(chat_input)
result.results[0].next_message

In [None]:
# Append the model's output to the chat
next_chat_input = chat_input.with_next_message(result.results[0].next_message)
next_chat_input.messages

In [None]:
# Instantiate the I/O processor for the PRM intrinsic
io_proc = ProcessRewardModelIOProcessor(backend=lora_backend)

# Set temperature to 0 because we are not sampling from the intrinsic's output
next_chat_input = next_chat_input.with_addl_generate_params({"temperature": 0.0})

# Pass our example input through the I/O processor and retrieve the result
chat_result = await io_proc.acreate_chat_completion(next_chat_input)

print(
    f"PRM score for the original response is "
    f"{chat_result.results[0].next_message.content}"
)

In [None]:
io_proc.inputs_to_generate_inputs(next_chat_input)

In [None]:
# Try with an artifical poor-quality assistant response.
from granite_io.types import AssistantMessage

chat_result_2 = await io_proc.acreate_chat_completion(
    chat_input.with_next_message(
        AssistantMessage(
            content="Weng earns 12/60 = 0.5 per minute. "
            "Working 50 minutes, she earned 0.5 x 50 = 250. "
            "Thus Weng earns $250 for 50 minutes of babysitting."
        )
    ).with_addl_generate_params({"temperature": 0.0})
)
print(
    f"PRM score for the low-quality response is "
    f"{chat_result_2.results[0].next_message.content}"
)

In [None]:
# Use the composite processor to generate multiple completions and select the completion
# with the highest PRM score
composite_proc = PRMBestOfNCompositeIOProcessor(
    granite_io_proc, lora_backend, include_score=True
)
composite_results, all_results = await composite_proc.acreate_chat_completion(
    chat_input.with_addl_generate_params({"n": 5, "temperature": 1.0})
)
composite_results.results

In [None]:
# Free up GPU resources
if "server" in locals():
    server.shutdown()