# Retrieval Augmented Generation (RAG) with Granite RAG 3.0 8b using Ollama

*Using IBM Granite Models*

## In this notebook

This notebook contains instructions for performing Retrieval Augmented Generation (RAG) using the Granite RAG 3.0 8b LoRA adapter using Ollama.

RAG is an architectural pattern that can be used to augment the performance of language models by recalling factual information from a knowledge base, and adding that information to the model query. The most common approach in RAG is to create dense vector representations of the knowledge base in order to retrieve text chunks that are semantically similar to a given user query.

The Granite RAG 3.0 8b adds hallucination detection and citation generation capability.

RAG use cases include:
- Customer service: Answering questions about a product or service using facts from the product documentation.
- Domain knowledge: Exploring a specialized domain (e.g., finance) using facts from papers or articles in the knowledge base.
- News chat: Chatting about current events by calling up relevant recent news articles.

In its simplest form, RAG requires 3 steps:

- Initial setup:
  - Index knowledge-base passages for efficient retrieval. In this recipe, we take embeddings of the passages using WatsonX, and store them in a vector database.
- Upon each user query:
  - Retrieve relevant passages from the database. In this recipe, we use an embedding of the query to retrieve semantically similar passages.
  - Generate a response by feeding retrieved passage into a large language model, along with the user query.

## Setting up the environment

Ensure you are running python 3.10 or 3.11 in a freshly-created virtual environment.

In [None]:
import sys
assert sys.version_info >= (3, 10) and sys.version_info < (3, 12), "Use Python 3.10 or 3.11 to run this notebook."

### Install dependencies

The LoRA adapter for Granite RAG 3.0 8b is available from [Hugging Face](https://huggingface.co/ibm-granite/granite-rag-3.0-8b-lora). In this recipe, we need to download it and convert the format to GGUF for use with Ollama.

Granite utils includes some helpful functions. We also use the llama.cpp project for GGUF file conversion.

In [None]:
! git clone --depth 1 https://github.com/ggerganov/llama.cpp.git
! pip install "git+https://github.com/ibm-granite-community/utils.git" \
    "huggingface_hub" \
    "langchain_community" \
    "langchain_ollama" \
    langchain-milvus \
    docling \
    -r llama.cpp/requirements.txt

## Selecting System Components

### Setup Granite with the Granite RAG LoRA adapter

The Granite RAG 3.0 8b LoRA adapter is built for the Granite 3.0 8b Instruct model.

We will use Ollama to configure Granite 3.0 8b Instruct as the base model and Granite RAG 3.0 8b as the LoRA adapter.

First make sure you have Ollama installed and `ollama serve` is running. Then we will download the Granite 3.0 8b Instruct model to use as the base model. The `granite3-dense:8b-instruct-fp16` model is in `fp16` format and is about 16 GB in size. The `granite3-dense:8b` model is quantized down to 4-bit format and is about 5 GB in size.
So you can use either model depending on how much memory you have available on your system.

In [None]:
# Use fp16 or quantized model
fp16 = False
if fp16:
    granite3_model = 'granite3-dense:8b-instruct-fp16'
    granite3_rag_model = 'granite3-rag:8b-fp16'
else:
    granite3_model = 'granite3-dense:8b'
    granite3_rag_model = 'granite3-rag:8b'
! ollama pull {granite3_model}

We also need to convert the LoRA adapter to fp16 in the GGUF format from safetensors. So we will download the safetensors of the LoRA adapter and the safetensors configuration of the base model using the `huggingface_hub` API.

In [None]:
import huggingface_hub

lora_folder = huggingface_hub.snapshot_download(repo_id="ibm-granite/granite-rag-3.0-8b-lora")
base_folder = huggingface_hub.snapshot_download(repo_id="ibm-granite/granite-3.0-8b-instruct", allow_patterns="*.json")


The download commands will display the folders into which the safetensors were downloaded. These folder names are needed for the conversion command. We will use the `convert_lora_to_gguf.py` command from the `llama.cpp` project to convert the LoRA adapter.


In [None]:
lora_gguf = "granite-rag-3.0-8b-lora-fp16.gguf"
!python3 llama.cpp/convert_lora_to_gguf.py --outtype f16 --outfile {lora_gguf} --base {base_folder} -- {lora_folder}


Create a `Modelfile` for Ollama to use the base model and the LoRA adapter together. This will use the base model we previously pulled along with the GGUF version of the LoRA adapter we just created.


In [None]:
with open("Modelfile", "w") as modelfile:
    modelfile.write(f"""\
FROM {granite3_model}
ADAPTER {lora_gguf}
""")


Finally, we create a model in Ollama.

In [None]:
! ollama create {granite3_rag_model} -f Modelfile


Now we can use Granite RAG 3.0 8b!

### Choose your Embeddings Model

Specify the model to use for generating embedding vectors from text.

You will need to download the embeddings model. First make sure you have Ollama installed and `ollama serve` is running.

In [None]:
embeddings = "granite-embedding:30m"
! ollama pull {embeddings}


To use a model from another provider, replace this code cell with one from [this Embeddings Model recipe](https://github.com/ibm-granite-community/granite-kitchen/blob/main/recipes/Components/Langchain_Embeddings_Models.ipynb).

In [None]:
from langchain_ollama.embeddings import OllamaEmbeddings

embeddings_model = OllamaEmbeddings(model=embeddings)

### Choose your Vector Database

Specify the database to use for storing and retrieving embedding vectors.

To connect to a vector database other than Milvus, replace this code cell with one from [this Vector Store recipe](https://github.com/ibm-granite-community/granite-kitchen/blob/main/recipes/Components/Langchain_Vector_Stores.ipynb).

In [None]:
from langchain_milvus import Milvus
import tempfile

db_file = tempfile.NamedTemporaryFile(prefix="milvus_", suffix=".db", delete=False).name
print(f"The vector database will be saved to {db_file}")

vector_db = Milvus(
    embedding_function=embeddings_model,
    connection_args={"uri": db_file},
    auto_id=True,
    enable_dynamic_field=True,
    index_params={"index_type": "AUTOINDEX"},
)

## Use the Granite RAG 3.0 8b model

Create a model object for the Granite RAG 3.0 8b model in Ollama.

In [None]:
from langchain_ollama.llms import OllamaLLM

model = OllamaLLM(model=granite3_rag_model)

## Building the Vector Database

In this example, from a set of source documents, we use [Docling](https://github.com/DS4SD/docling) to convert the documents into text and then split the text into chunks, derive embedding vectors using the embedding model, and load it into the vector database for querying.

### Use Docling to download the documents, convert to text, and split into chunks


Here we use a set of web pages about IBM and the US Open. For each source web page, we convert the web page into a DoclingDocument and then chunk the DoclingDocument. Finally LangChain Documents are created for all the chunks labeled text or paragraph. The Documents are annotated with metadata to define a unique document id and the source of the document.

In [None]:
from docling.document_converter import DocumentConverter
from docling_core.transforms.chunker.hierarchical_chunker import HierarchicalChunker
from docling_core.types.doc.labels import DocItemLabel
from langchain_core.documents import Document

sources = [
    "https://www.ibm.com/case-studies/us-open",
    "https://www.ibm.com/sports/usopen",
    "https://newsroom.ibm.com/US-Open-AI-Tennis-Fan-Engagement",
    "https://newsroom.ibm.com/2024-08-15-ibm-and-the-usta-serve-up-new-and-enhanced-generative-ai-features-for-2024-us-open-digital-platforms",
]

converter = DocumentConverter()
i = 0
texts: list[Document] = [
    Document(page_content=chunk.text, metadata={"doc_id": (i:=i+1), "source": source})
    for source in sources
    for chunk in HierarchicalChunker().chunk(converter.convert(source=source).document)
    if any(filter(lambda c: c.label in [DocItemLabel.TEXT, DocItemLabel.PARAGRAPH], iter(chunk.meta.doc_items)))
]

print(f"{len(texts)} documents created")

### Populate the vector database

NOTE: Population of the vector database may take over a minute depending on your embedding model and service.

In [None]:
ids = vector_db.add_documents(texts)
print(f"{len(ids)} documents added to the vector database")

## Querying the Vector Database

We define the query to use for the RAG operation.

In [None]:
query = "How did IBM use watsonx at the 2024 US Open Tennis Championship?"

### Conduct a similarity search

Search the database for similar documents by proximity of the embedded vector in vector space to demonstrate the similarity search used during the RAG operation.

In [None]:
docs = vector_db.similarity_search(query)
print(f"{len(docs)} documents returned")
for d in docs:
    print(f"doc_id={d.metadata['doc_id']}: {d.page_content}")

## Answering Questions

### Create the prompt for Granite RAG 3.0 8b

For Granite RAG 3.0 8b, we construct the prompt in a specific JSON format which includes the retrieved documents and metadata about the information to be included in the response. The values for `input` (the question), `hallucination_tags`, and `citations` are supplied when the chain is invoked.

In [None]:
from langchain.prompts import PromptTemplate

# Create a prompt template for question-answering with the retrieved context.
prompt_template = PromptTemplate.from_template(template="""<|start_of_role|>system<|end_of_role|>\
{{
  "instruction": "Respond to the user's latest question based solely on the information provided in the documents.
Ensure that your response is strictly aligned with the facts in the provided documents.
If the information needed to answer the question is not available in the documents, inform the user that the question cannot be answered based on the available data.
Make sure that your response follows the attributes mentioned in the 'meta' field.",
  "documents": [{context}],
  "meta": {{
    "hallucination_tags": {hallucination_tags},
    "citations": {citations}
  }}
}}<|end_of_text|>
<|start_of_role|>user<|end_of_role|>{input}""")

# Create a document prompt template to wrap each retrieved document
document_prompt_template = PromptTemplate.from_template(template="""\
{{"doc_id": {doc_id}, "text": "{page_content}"}}""")

### Automate the RAG pipeline


We now build a RAG chain with the model and the document retriever and the prompts.

In [None]:
from langchain.chains.retrieval import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

# Assemble the retrieval-augmented generation chain.
combine_docs_chain = create_stuff_documents_chain(model, prompt_template, document_prompt=document_prompt_template, document_separator=",")
rag_chain = create_retrieval_chain(vector_db.as_retriever(), combine_docs_chain)

### Generate a retrieval-augmented response to a question

Use the RAG chain to process a question. The document chunks relevant to that question are retrieved and used as context. The response from Granite RAG 3.0 8b in a JSON document. This cell then parses the JSON document to retrieve the sentences of the response along with metadata about the sentence which can be used to guide the displayed output.

In [None]:
import json
from langchain_core.utils.json import parse_json_markdown

output = rag_chain.invoke({"input": query, "hallucination_tags": "true", "citations": "false"})

print(f"Question:\n{output['input']}")
print("\nAnswer:")
try:
    responses = parse_json_markdown(output['answer'])
    need_footnote = False
    for response in responses:
        sentence = response.get("sentence")
        meta = response.get("meta", {})
        hallucination_level = meta.get("hallucination_level", "low")
        match hallucination_level:
            case "low" | "unanswerable":
                 print(sentence)
            case "high" | _:
                need_footnote = True
                print(sentence, "¹", sep="")
    if need_footnote:
        print("\n¹ Warning: the sentence was not generated using the retrieved documents.")
    print("\nOriginal response in JSON format:")
    print(json.dumps(responses, indent=2))
except json.JSONDecodeError:
    print("\nOriginal response which was unable to be parsed as JSON:")
    print(output['answer'])