# Demonstration of the Granite certainty intrisic

This notebook shows the usage of the IO processor for the Granite certainty intrisic, 
also known as the [Granite 3.2 8B Instruct Uncertainty LoRA](
    https://huggingface.co/ibm-granite/granite-uncertainty-3.2-8b-lora
)

To run this notebook, you will need to host Granite 3.2 8B and the Granite 3.2 8B 
Instruct Uncertainty LoRA on your own machine. The constants below assume you started a
local vLLM server with the command:
```
vllm serve ibm-granite/granite-3.2-8b-instruct \
    --enable-lora \
    --max_lora_rank 64 \
    --lora-modules ibm-granite/granite-uncertainty-3.2-8b-lora=ibm-granite/granite-uncertainty-3.2-8b-lora \
    --port 11434 \
    --gpu-memory-utilization 0.5 \
    --max-model-len 8192
```

Update the constants below to reflecthow you are hosting the model.

In [None]:
# Imports go here
from granite_io.io.granite_3_2.input_processors.granite_3_2_input_processor import (
    _Granite3Point2Inputs,
)
from granite_io import make_io_processor, make_backend
from granite_io.io.certainty import CertaintyIOProcessor

In [None]:
# Constants go here
base_model_name = "ibm-granite/granite-3.2-8b-instruct"
lora_model_name = "ibm-granite/granite-uncertainty-3.2-8b-lora"

# You will need to set the following variables to appropriate values for your own
# OpenAI-compatible inference server:
openai_base_url = "http://localhost:11434/v1"
openai_base_model_name = "ibm-granite/granite-3.2-8b-instruct"
openai_lora_model_name = "ibm-granite/granite-uncertainty-3.2-8b-lora"

In [None]:
backend = make_backend(
    "openai",
    {
        "model_name": openai_base_model_name,
        "openai_base_url": openai_base_url,
    },
)
lora_backend = make_backend(
    "openai",
    {
        "model_name": openai_lora_model_name,
        "openai_base_url": openai_base_url,
    },
)

In [None]:
# Create an example chat completion with a user question and two documents.
chat_input = _Granite3Point2Inputs.model_validate(
    {
        "messages": [
            {"role": "assistant", "content": "Welcome to pet questions!"},
            {"role": "user", "content": "Which of my pets have fleas?"},
        ],
        "documents": [
            {"text": "My dog has fleas."},
            {"text": "My cat does not have fleas."},
        ],
        "generate_inputs": {
            "temperature": 0.0,
            "max_tokens": 4096,
        },
    }
)
chat_input

In [None]:
# Pass the example input through Granite 3.2 to get an answer
granite_io_proc = make_io_processor("Granite 3.2", backend=backend)
result = await granite_io_proc.acreate_chat_completion(chat_input)
result.results[0].next_message

In [None]:
# Append the model's output to the chat
next_chat_input = chat_input.with_next_message(result.results[0].next_message)
next_chat_input.messages

In [None]:
# Instantiate the I/O processor for the certainty intrinsic
io_proc = CertaintyIOProcessor(lora_backend)

# Set temperature to 0 because we are not sampling from the intrinsic's output
next_chat_input = next_chat_input.with_addl_generate_params({"temperature": 0.0})

# Pass our example input through the I/O processor and retrieve the result
chat_result = await io_proc.acreate_chat_completion(next_chat_input)

print(
    f"Certainty score for this response is "
    f"{chat_result.results[0].next_message.content}"
)

In [None]:
# Try with an artifical poor-quality assistant response.
from granite_io.types import AssistantMessage

chat_result_2 = await io_proc.acreate_chat_completion(
    chat_input.with_next_message(
        AssistantMessage(content="Your iguana is absolutely covered in fleas.")
    ).with_addl_generate_params({"temperature": 0.0})
)
print(
    f"Certainty score for this response is "
    f"{chat_result_2.results[0].next_message.content}"
)