# Introduction to granite-io and the Granite 3.3 RAG Agent Library

This notebook provides a high-level introduction to the `granite-io` library and to
the [Granite 3.3 RAG Agent Library](
    https://huggingface.co/ibm-granite/granite-3.3-8b-rag-agent-lib).


This notebook can run its own vLLM server to perform inference, or you can host the 
models on your own server. 

To use your own server, set the `run_server` variable below
to `False` and set appropriate values for the constants in the cell marked
`# Constants go here`.

Other notebooks in this directory provide a more in-depth treatment of concepts covered
in this notebook:

* Advanced end-to-end Retrieval Augmented Generation flows: [rag.ipynb](./rag.ipynb)
* `granite-io` library: [io.ipynb](./io.ipynb)
* LoRA Adapter for Answerability Classification: [answerability.ipynb](./answerability.ipynb)
* Granite 3.3 8b Instruct - Uncertainty LoRA: [certainty.ipynb](./certainty.ipynb)
* LoRA Adapter for Citation Generation: [citations.ipynb](./citations.ipynb)
* LoRA Adapter for Query Rewrite: [query_rewrite.ipynb](./query_rewrite.ipynb)
* Retrieval IO processor: [retrieval.ipynb](./retrieval.ipynb)

In [None]:
import pathlib
from granite_io.io.granite_3_3.input_processors.granite_3_3_input_processor import (
    Granite3Point3Inputs,
)
from granite_io import make_io_processor, make_backend
from granite_io.io.base import RewriteRequestProcessor
from granite_io.io.retrieval.util import download_mtrag_embeddings
from granite_io.io.retrieval import (
    Retriever,
    InMemoryRetriever,
    RetrievalRequestProcessor,
)
from granite_io.io.answerability import (
    AnswerabilityIOProcessor,
)
from granite_io.io.query_rewrite import QueryRewriteIOProcessor
from granite_io.io.citations import CitationsIOProcessor, CitationsCompositeIOProcessor
from granite_io.io.hallucinations import HallucinationsIOProcessor
from granite_io.backend.vllm_server import LocalVLLMServer
from granite_io.io.certainty import CertaintyIOProcessor
from granite_io.types import GenerateInputs
from granite_io.visualization import CitationsWidget
from IPython.display import display, Markdown
from granite_io.io.rag_agent_lib import obtain_loras
import pandas as pd
import os

In [None]:
# Constants go here
temp_data_dir = "../data/test_retrieval_temp"
corpus_name = "govt"
embeddings_data_file = pathlib.Path(temp_data_dir) / f"{corpus_name}_embeds.parquet"
embedding_model_name = "multi-qa-mpnet-base-dot-v1"
model_name = "ibm-granite/granite-3.3-8b-instruct"

query_rewrite_lora_name = "query_rewrite"
citations_lora_name = "citation_generation"
answerability_lora_name = "answerability_prediction"
hallucination_lora_name = "hallucination_detection"
certainty_lora_name = "certainty"
all_lora_names = [
    query_rewrite_lora_name,
    citations_lora_name,
    answerability_lora_name,
    hallucination_lora_name,
    certainty_lora_name,
]

# Download the indexed corpus if it hasn't already been downloaded.
# This notebook uses a subset of the government corpus from the MTRAG benchmark.
embeddings_location = f"{temp_data_dir}/{corpus_name}_embeds.parquet"
if not os.path.exists(embeddings_location):
    download_mtrag_embeddings(embedding_model_name, corpus_name, embeddings_location)

run_server = True

## granite-io

The `granite-io` library provides input and output processing for large language models.
In this context, *input and output processing* refers to the steps that happen 
immediately before and after low-level model inference. These steps include:

* **Input processing:** Translating application data structures such as messages and 
  documents into a string prompt for a particular model
* **Output processing:** Parsing the raw string output of a language model into 
  structured application data
* **Constrained decoding:** Constraining the raw string output of an LLM to ensure that
  the model's output will always parse into structured application data
* **Inference-time scaling:** Extracting a higher-quality answer from an LLM by 
  combining the results of multiple inference calls.


`granite-io` includes three main types of entry points:
* **Backend connectors** connect the `granite-io` library to different model inference 
  engines and vector databases.
  The other components of `granite-io` use these adapters to invoke model inference with
  exactly the right low-level parameters for each model and inference layer.
* **InputOutputProcessors** provide input and output processing for specific models.
  An InputOutputProcessor exposes a "chat completions" interface, where the input is the
  structured representation of a conversation and the output is the next turn of the
  conversation.
  For some models, such as [IBM Granite 3.3](https://huggingface.co/collections/ibm-granite/granite-33-language-models-67f65d0cca24bcbd1d3a08e3), we also provide
  separate APIs that only perform input processing or output processing.
* **RequestProcessors** rewrite chat completion requests in various ways, such as 
  rewording messages, attaching RAG documents, or filtering documents. You can chain
  one or more RequestProcessors with an InputOutputProcessor to implement a custom 
  inference workflow.

## Backends

All the parts of `granite-io` that we exercise in this notebook rely on the Backend 
API, so we start by instantiating a Backend instance for each of the models that
this notebook uses.

In [None]:
if run_server:
    # Start by firing up a local vLLM server and connecting a backend instance to it.
    # Download and cache LoRA adapters.
    lora_model_paths = obtain_loras(all_lora_names)
    server = LocalVLLMServer(
        model_name,
        lora_adapters=lora_model_paths,
    )
    server.wait_for_startup(200)
    query_rewrite_lora_backend = server.make_lora_backend(query_rewrite_lora_name)
    citations_lora_backend = server.make_lora_backend(citations_lora_name)
    answerability_lora_backend = server.make_lora_backend(answerability_lora_name)
    hallucination_lora_backend = server.make_lora_backend(hallucination_lora_name)
    certainty_lora_backend = server.make_lora_backend(certainty_lora_name)
    backend = server.make_backend()
else:  # if not run_server
    # Use an existing server.
    # The constants here are for the server that local_vllm_server.ipynb starts.
    # Modify as needed.
    openai_base_url = "http://localhost:55555/v1"
    openai_api_key = "granite_intrinsics_1234"
    backend = make_backend(
        "openai",
        {
            "model_name": model_name,
            "openai_base_url": openai_base_url,
            "openai_api_key": openai_api_key,
        },
    )
    query_rewrite_lora_backend = make_backend(
        "openai",
        {
            "model_name": query_rewrite_lora_name,
            "openai_base_url": openai_base_url,
            "openai_api_key": openai_api_key,
        },
    )
    citations_lora_backend = make_backend(
        "openai",
        {
            "model_name": citations_lora_name,
            "openai_base_url": openai_base_url,
            "openai_api_key": openai_api_key,
        },
    )
    answerability_lora_backend = make_backend(
        "openai",
        {
            "model_name": answerability_lora_name,
            "openai_base_url": openai_base_url,
            "openai_api_key": openai_api_key,
        },
    )
    hallucination_lora_backend = make_backend(
        "openai",
        {
            "model_name": hallucination_lora_name,
            "openai_base_url": openai_base_url,
            "openai_api_key": openai_api_key,
        },
    )
    certainty_lora_backend = make_backend(
        "openai",
        {
            "model_name": certainty_lora_name,
            "openai_base_url": openai_base_url,
            "openai_api_key": openai_api_key,
        },
    )

The Backend API in `granite-io` runs low-level inference on the target
model, passing in raw string prompts and inference paramters and receiving back raw 
string results:

In [None]:
generate_result = await backend.generate(
    GenerateInputs(
        prompt="Complete this sequence: 2, 3, 5, 7, 11, 13, ",
        model=model_name,
        temperature=0.0,
        max_tokens=12,
    )
)
print(generate_result.model_dump_json(indent=2))

Most users don't interact with the low-level backend API directly. The recommended way
to use `granite-io` is via the InputOutputProcessor APIs, which convert high-level 
request into the specific combination of inference paramters that the model needs,
run inference, and then convert the model's raw output into something that an 
application can use directly.

Let's create an example chat completion request so we can show how the high-level 
InputOutputProcessor API works.

In [None]:
chat_input = Granite3Point3Inputs.model_validate(
    {
        "messages": [
            {
                "role": "assistant",
                "content": "Welcome to the City of Dublin, CA help desk.",
            },
            {
                "role": "user",
                "content": "Hi there. Can you answer questions about fences?",
            },
            {
                "role": "assistant",
                "content": "Absolutely, I can provide general information about "
                "fences in Dublin, CA.",
            },
            {
                "role": "user",
                "content": "Great. I want to add one in my front yard. Do I need a "
                "permit?",
            },
        ],
        "generate_inputs": {
            "temperature": 0.0,
            "max_tokens": 4096,
        },
    }
)


def print_chat(c):
    display(
        Markdown(
            "\n".join([f"**{m.role.capitalize()}:** {m.content}\n" for m in c.messages])
        )
    )


print_chat(chat_input)

This chat completion request models a scenario where the user is talking to the 
automated help desk for the City of Dublin, CA and has just asked a question about 
permitting for installing fences. Running this chat completion request should produce
an assistant response to this question.

If we pass our chat completion (`chat_input`) to a `granite-io` InputOutputProcessor's 
`create_chat_completion()` method, the InputOutputProcessor will create a string prompt
for the model, set up model-specific generation parameters, invoke model inference, and
parse the model's raw output into a structured message.

Here we create an InputOutputProcessor for the [IBM Granite 3.3](
    https://huggingface.co/ibm-granite/granite-3.3-8b-instruct) model and point that InputOutputProcessor at the backend we used previously.

In [None]:
io_proc = make_io_processor(model_name, backend=backend)
# Use the IO processor to generate a chat completion
non_rag_result = io_proc.create_chat_completion(chat_input)
display(Markdown(non_rag_result.results[0].next_message.content))

The model's response here is generic and vague, because the model's training data does 
not cover obscure zoning ordinances of small cities in northern California.

We can use the 
[Granite 3.3 8b Instruct - Uncertainty LoRA](
    https://huggingface.co/ibm-granite/granite-3.3-8b-rag-agent-lib/blob/main/certainty_lora/README.md)
model to flag cases such as this one that are not covered by the base model's 
training data. 

This model comes packaged as a LoRA adapter on top of Granite 3.3. To run the model, we
create an instance of `CertaintyIOProcessor` -- the `granite-io` InputOutputProcessor
for this model -- and point this InputOutputProcessor at a Backend that we have
connected to the model's LoRA adapter. Then we can pass the same chat completion request
into the model to compute a certainty score from 0 to 1.0.

In [None]:
certainty_io_proc = CertaintyIOProcessor(certainty_lora_backend)
certainty_score = (
    certainty_io_proc.create_chat_completion(chat_input).results[0].next_message.content
)
print(f"Certainty score is {certainty_score} out of 1.0")

The low certainty score indicates that the model's training data does not align closely
with this question.

To answer this question properly, we need to provide the model with domain-specific 
information. One of the most popular ways to add domain-specific information to an LLM
is to use the Retrieval-Augmented Generation (RAG) pattern. RAG involves retrieving
snippets of text from a collection of documents and adding those snippets to the model's
prompt.


In this case, the relevant information can be found in the Government 
corpus of the [MTRAG multi-turn RAG benchmark](https://github.com/IBM/mt-rag-benchmark).
Similar to its connectors for inference backends, `granite-io` has adapters for 
RAG retrieval backends.

Let's spin up a connection in-memory vector database, using embeddings that we've 
precomputed offline from the MTRAG Government corpus.

In [None]:
retriever = InMemoryRetriever(embeddings_data_file, embedding_model_name)

`granite-io` also includes a RequestProcessor that performs the retrieval phase of
RAG. This class, called `RetrievalRequestProcessor`, takes as input a chat completion
request. The RequestProcessor uses the text of the last user turn to query a `Retriever`
instance and fetch document snippets.

In [None]:
retrieval_request_proc = RetrievalRequestProcessor(retriever, top_k=3)
chat_input_with_docs = retrieval_request_proc.process(chat_input)[0]
chat_input_with_docs.model_dump()

Unfortunately, the last user turn in this conversation is:
> **User:** Great. I want to add one in my front yard. Do I need a permit?

This text is missing key details for retrieving relevant documents: What does the 
user want to add to their front yard, and what city's municipal code applies to this
yard? As a result, the retrieved documents aren't actually relevant to the user's 
question.

The [LoRA Adapter for Answerability Classification](
    https://huggingface.co/ibm-granite/granite-3.3-8b-rag-agent-lib/blob/main/answerability_prediction_lora/README.md)
provides a robust way to detect this kind of problem. Here's what happens if we 
run the chat completion request with irrelevant document snippets through the 
answerability model, using the
`granite_io` IO processor for the model to handle input and output:

In [None]:
# Retrieval step from before...
retrieval_request_proc = RetrievalRequestProcessor(retriever, top_k=3)
chat_input_with_docs = retrieval_request_proc.process(chat_input)[0]

# ...followed by an answerability check
answerability_proc = AnswerabilityIOProcessor(answerability_lora_backend)
answerability_proc.create_chat_completion(chat_input_with_docs).results[
    0
].next_message.content

We can use use the [LoRA Adapter for Query Rewrite](
    https://huggingface.co/ibm-granite/granite-3.3-8b-rag-agent-lib/blob/main/query_rewrite_lora/README.md) to rewrite
the last user turn into a string that is more useful for retrieiving document snippets.
`granite-io` includes an InputOutputProcessor for running this model.
Here's how to use this InputOutputProcessor to apply this model to our example 
conversation:

In [None]:
rewrite_io_proc = QueryRewriteIOProcessor(query_rewrite_lora_backend)
rewrite_io_proc.create_chat_completion(chat_input).results[0].next_message.content

The query rewrite model turns the last user turn in this conversation from:
> **User:** Great. I want to add one in my front yard. Do I need a permit?

...to a version of the same question that includes vital additional context:
> **User:** Do I need a permit to add a fence in my front yard in Dublin, CA?

This more specific query should allow the retriever to fetch better document snippets.

The following code snippet uses `granite-io` APIs to rewrite the user query, then
fetch relevant document snippets.

In [None]:
# Redo initialization so this cell can run independently of previous cells
rewrite_io_proc = QueryRewriteIOProcessor(query_rewrite_lora_backend)
rewrite_request_proc = RewriteRequestProcessor(rewrite_io_proc)
retrieval_request_proc = RetrievalRequestProcessor(retriever, top_k=3)


# Rewrite the last user turn into something more suitable for retrieval.
input = rewrite_request_proc.process(chat_input)[0]

# Retrieve document snippets based on the rewritten turn and attach them to the chat
# completion request.
input = retrieval_request_proc.process(input)[0]
input = input.with_messages(chat_input.messages)  # Go back to original last user turn

input.model_dump()

Attaching relevant information causes the model to respond with a more specific and 
detailed answer. Here's the result that we get when we pass the rewritten chat 
completion request to the InputOutputProcessor for Granite 3.2:

In [None]:
io_proc = make_io_processor(model_name, backend=backend)
rag_result = io_proc.create_chat_completion(input)
display(Markdown(rag_result.results[0].next_message.content))

The answer contains specific details about permits for building fences in Dublin, CA.
These facts should grounded in documents retrieved from the corpus. We would like
to be able to prove that the model used the data from the corpus and did not 
hallucinate a fictitious building code.

We can use the [LoRA Adapter for Citation Generation](
    https://huggingface.co/ibm-granite/granite-3.2-8b-lora-rag-citation-generation
) to explain exactly how this response is grounded in the documents that the rewritten
user query retrieves. As with the other models we've shown so far, `granite-io` includes
an InputOutputProcessor for this model. We can use this InputOutputProcessor to add
citations to the assistant response from the previous cell:

In [None]:
citations_io_proc = CitationsIOProcessor(citations_lora_backend)

# Add the assistant response to the original chat completion request
input_with_next_message = input.with_next_message(rag_result.results[0].next_message)

# Augment this response with citations to the RAG document snippets
results_with_citations = citations_io_proc.create_chat_completion(
    input_with_next_message
)
CitationsWidget().show(input_with_next_message, results_with_citations)

We can also use the [LoRA Adapter for Hallucination Detection in RAG outputs](
    https://huggingface.co/ibm-granite/granite-3.3-8b-rag-agent-lib/blob/main/hallucination_detection_lora/README.md
) to check whether each sentence of the assistant response is consistent with the
information in the retrieved documents.

In [None]:
hallucinations_io_proc = HallucinationsIOProcessor(hallucination_lora_backend)
result_with_hallucinations = hallucinations_io_proc.create_chat_completion(
    input_with_next_message
).results[0]

print("Hallucination Checks:")
display(
    pd.DataFrame.from_records(
        [h.model_dump() for h in result_with_hallucinations.next_message.hallucinations]
    )
)

The `granite-io` library also allows developers to create their own custom 
InputOutputProcessors. For example, here's an InputOutputProcessor that rolls up the
rewrite, retrieval, and citations processing steps from this notebook into a single
`create_chat_completion()` call:

In [None]:
from granite_io.io.base import InputOutputProcessor
from granite_io.backend import Backend
from granite_io.io.base import ChatCompletionInputs, ChatCompletionResults


class MyRAGIOProcessor(InputOutputProcessor):
    def __init__(
        self,
        base_backend: Backend,
        base_model_name: str,
        retriever: Retriever,
        query_rewrite_lora_backend: Backend,
        citations_lora_backend: Backend,
    ):
        self.rewrite_request_proc = RewriteRequestProcessor(
            QueryRewriteIOProcessor(query_rewrite_lora_backend)
        )
        self.retrieval_request_proc = RetrievalRequestProcessor(retriever)

        # Build up a chain of two IO processors: base model -> citations
        self.io_proc_chain = CitationsCompositeIOProcessor(
            make_io_processor(base_model_name, backend=base_backend),
            citations_lora_backend,
        )

    async def acreate_chat_completion(
        self, inputs: ChatCompletionInputs
    ) -> ChatCompletionResults:
        """
        Chat completions API inherited from the ``InputOutputProcessor`` base class.

        :param inputs: Structured representation of the inputs to a chat completion
            request, possibly including additional fields that only this input-output
            processor can consume

        :returns: The next message that the model produces when fed the specified
            inputs, plus additional information about the low-level request.
        """
        original_inputs = inputs

        # Rewrite the last user turn for retrieval
        inputs = (await rewrite_request_proc.aprocess(inputs))[0]

        # Retrieve documents with the rewritten last turn
        inputs = (await retrieval_request_proc.aprocess(inputs))[0]

        # Switch back to original version of last turn
        inputs = inputs.with_messages(original_inputs.messages)

        # Generate a response and add citations
        return await self.io_proc_chain.acreate_chat_completion(inputs)

We can wrap all of the functionality we've shown so far in a single class that 
inherits from the `InputOutputProcessor` interface in `granite-io`. Packaging things
this way lets applications treat this multi-step flow as if it was a single chat 
completion request to a base model.

In [None]:
rag_io_proc = MyRAGIOProcessor(
    base_backend=backend,
    base_model_name=model_name,
    retriever=retriever,
    query_rewrite_lora_backend=query_rewrite_lora_backend,
    citations_lora_backend=citations_lora_backend,
)

rag_results = rag_io_proc.create_chat_completion(chat_input)
CitationsWidget().show(input_with_next_message, rag_results)

In [None]:
# Free up GPU resources
if "server" in locals():
    server.shutdown()