# Retrieval Augmented Generation (RAG) intrinsics in Granite 3.2

*Using IBM Granite Models*

## In this notebook

This notebook demonstrates using the experimental intrinsics for Retrieval Augmented Generation (RAG) in Granite 3.2.

RAG is an architectural pattern that can be used to augment the performance of language models by recalling factual information from a knowledge base, and adding that information to the model query. The most common approach in RAG is to create dense vector representations of the knowledge base in order to retrieve text chunks that are semantically similar to a given user query.

The Granite 3.2 model's experimental intrinsics provides hallucination confidence and citation generation capabilities for RAG operations.

RAG use cases include:
- Customer service: Answering questions about a product or service using facts from the product documentation.
- Domain knowledge: Exploring a specialized domain (e.g., finance) using facts from papers or articles in the knowledge base.
- News chat: Chatting about current events by calling up relevant recent news articles.

In its simplest form, RAG requires 3 steps:

- Initial setup:
  - Index knowledge-base passages for efficient retrieval. In this recipe, we take embeddings of the passages using WatsonX, and store them in a vector database.
- Upon each user query:
  - Retrieve relevant passages from the database. In this recipe, we use an embedding of the query to retrieve semantically similar passages.
  - Generate a response by feeding retrieved passages into a large language model, along with the user query.

NOTE: In Granite 3.2, the hallucination confidence and citation generation capabilities are currently considered experimental.

## Setting up the environment

### Install dependencies

In [None]:
! pip install "git+https://github.com/ibm-granite-community/utils.git" \
    "transformers" \
    "accelerate" \
    "langchain_community" \
    "langchain_huggingface" \
    "langchain-milvus" \
    "replicate" \
    "docling"

## Selecting System Components

### Choose your Embeddings Model

Specify the model to use for generating embedding vectors from text. Here we will be using one of the new [Granite Embeddings models](https://huggingface.co/collections/ibm-granite/granite-embedding-models-6750b30c802c1926a35550bb)

To use a model from another provider, replace this code cell with one from [this Embeddings Model recipe](https://github.com/ibm-granite-community/granite-kitchen/blob/main/recipes/Components/Langchain_Embeddings_Models.ipynb).

In [None]:
from langchain_huggingface import HuggingFaceEmbeddings
from transformers import AutoTokenizer

embeddings_model_path = "ibm-granite/granite-embedding-30m-english"
embeddings_model = HuggingFaceEmbeddings(
    model_name=embeddings_model_path,
)

### Choose your Vector Database

Specify the database to use for storing and retrieving embedding vectors.

To connect to a vector database other than Milvus, replace this code cell with one from [this Vector Store recipe](https://github.com/ibm-granite-community/granite-kitchen/blob/main/recipes/Components/Langchain_Vector_Stores.ipynb).

In [None]:
from langchain_milvus import Milvus
import tempfile

db_file = tempfile.NamedTemporaryFile(prefix="milvus_", suffix=".db", delete=False).name
print(f"The vector database will be saved to {db_file}")

vector_db = Milvus(
    embedding_function=embeddings_model,
    connection_args={"uri": db_file},
    auto_id=True,
    enable_dynamic_field=True,
    index_params={"index_type": "AUTOINDEX"},
)

## Use the Granite model

Select a Granite model from the [`ibm-granite`](https://replicate.com/ibm-granite) org on Replicate. Here we use the Replicate Langchain client to connect to the model.

To get set up with Replicate, see [Getting Started with Replicate](https://github.com/ibm-granite-community/granite-kitchen/blob/main/recipes/Getting_Started/Getting_Started_with_Replicate.ipynb).

To connect to a model on a provider other than Replicate, substitute this code cell with one from the [LLM component recipe](https://github.com/ibm-granite-community/granite-kitchen/blob/main/recipes/Components/Langchain_LLMs.ipynb).

In [None]:
from langchain_community.llms import Replicate
from ibm_granite_community.notebook_utils import get_env_var

model_path = "ibm-granite/granite-3.2-8b-instruct"
model = Replicate(
    model=model_path,
    replicate_api_token=get_env_var("REPLICATE_API_TOKEN"),
    model_kwargs={
        "max_tokens": 1000, # Set the maximum number of tokens to generate as output.
        "min_tokens": 100, # Set the minimum number of tokens to generate as output.
    },
)
tokenizer = AutoTokenizer.from_pretrained(model_path)

## Building the Vector Database

In this example, from a set of source documents, we use [Docling](https://docling-project.github.io/docling/) to convert the documents into text and then split the text into chunks, derive embedding vectors using the embedding model, and load it into the vector database for querying.

### Use Docling to download the documents, convert to text, and split into chunks


Here we use a set of web pages about IBM and the US Open. For each source web page, we convert the web page into a DoclingDocument and then chunk the DoclingDocument. Finally LangChain Documents are created for all the chunks labeled text or paragraph. The Documents are annotated with metadata to define a unique document id and the source of the document.

In [None]:
from docling.document_converter import DocumentConverter
from docling_core.transforms.chunker.hierarchical_chunker import HierarchicalChunker
from docling_core.types.doc.labels import DocItemLabel
from langchain_core.documents import Document

sources = [
    "https://www.ibm.com/case-studies/us-open",
    "https://www.ibm.com/sports/usopen",
    "https://newsroom.ibm.com/US-Open-AI-Tennis-Fan-Engagement",
    "https://newsroom.ibm.com/2024-08-15-ibm-and-the-usta-serve-up-new-and-enhanced-generative-ai-features-for-2024-us-open-digital-platforms",
]

converter = DocumentConverter()
i = 0
texts: list[Document] = [
    Document(page_content=chunk.text, metadata={"doc_id": (i:=i+1), "source": source})
    for source in sources
    for chunk in HierarchicalChunker().chunk(converter.convert(source=source).document)
    if any(filter(lambda c: c.label in [DocItemLabel.TEXT, DocItemLabel.PARAGRAPH], iter(chunk.meta.doc_items)))
]

print(f"{len(texts)} documents created")

### Populate the vector database

NOTE: Population of the vector database may take over a minute depending on your embedding model and service.

In [None]:
ids = vector_db.add_documents(texts)
print(f"{len(ids)} documents added to the vector database")

## Querying the Vector Database

We define the query to use for the RAG operation.

In [None]:
query = "How did IBM use watsonx at the 2024 US Open Tennis Championship?"

### Conduct a similarity search

Search the database for similar documents by proximity of the embedded vector in vector space to demonstrate the similarity search used during the RAG operation.

In [None]:
docs = vector_db.similarity_search(query)
print(f"{len(docs)} documents returned")
for d in docs:
    print(f"doc_id={d.metadata['doc_id']}: {d.page_content}")

## Answering Questions

### Create the prompt for Granite

We define the prompts for Granite to perform the RAG query. We use the Granite chat template and supply the placeholder values that the LangChain RAG pipeline will replace. We also use Granite chat template controls to enable the hallucination confidence and citation capabilities.

In [None]:
from langchain.prompts import PromptTemplate
from string import Formatter

# controls for enabling Granite intrinsic capability for hallucination confidence and citations
controls = {
    "hallucinations": True,
    "citations": True,
}

# Create a Granite prompt for question-answering with the retrieved context
prompt = tokenizer.apply_chat_template(
    conversation=[{
        "role": "user",
        "content": "{input}",
      }],
    documents=[{
        "title": "placeholder",
        "text": "{context}",
    }],
    controls=controls,
    add_generation_prompt=True,
    tokenize=False,
)

# The Granite prompt can contain JSON strings
def escape_f_string(f_string: str, *keys: str) -> str:
    """Escape non-keys in the specified f-string.

    Args:
        f_string (str): The f-string to escape.
        keys: The key names which are part of the f-string and should not be escaped.

    Returns:
        str: The f-string with non-keys escaped in double braces.
    """
    result = []
    for literal_text, field_name, format_spec, conversion in Formatter().parse(f_string):
        if literal_text:
            result.append(literal_text)
        if field_name is not None:
            is_key = field_name in keys
            result.append("{" if is_key else "{{")
            result.append(field_name)
            if conversion:
                result.append("!")
                result.append(conversion)
            if format_spec:
                result.append(":")
                result.append(format_spec)
            result.append("}" if is_key else "}}")
    return "".join(result)

prompt_template = PromptTemplate.from_template(template=escape_f_string(prompt, "input", "context"))

# Create a Granite document prompt template to wrap each retrieved document
document_prompt_template = PromptTemplate.from_template(template="""\
Document {doc_id}
{page_content}""")
document_separator="\n\n"

### Automate the RAG pipeline


We now build a RAG chain with the model and the document retriever and the prompts.

In [None]:
from langchain.chains.retrieval import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

# Assemble the retrieval-augmented generation chain.
combine_docs_chain = create_stuff_documents_chain(
    llm=model,
    prompt=prompt_template,
    document_prompt=document_prompt_template,
    document_separator=document_separator,
)
rag_chain = create_retrieval_chain(
    retriever=vector_db.as_retriever(),
    combine_docs_chain=combine_docs_chain,
)

### Generate a retrieval-augmented response to a question

Use the RAG chain to process a question. The document chunks relevant to that question are retrieved and used as context. Since we enabled Granite hallucination confidence and citation capabilities, the response include information about this in the response.

In [None]:
output = rag_chain.invoke({"input": query})

print(output['answer'])