# Retrieval Augmented Generation (RAG) with Granite RAG 3.0 8b using Hugging Face Transformers and PEFT Libraries

*Using IBM Granite Models*

## In this notebook

This notebook contains instructions for performing Retrieval Augmented Generation (RAG) using the [Granite RAG 3.0 8b LoRA adapter](https://huggingface.co/ibm-granite/granite-rag-3.0-8b-lora) using the Hugging Face Transformers and PEFT libraries.

RAG is an architectural pattern that can be used to augment the performance of language models by recalling factual information from a knowledge base, and adding that information to the model query. The most common approach in RAG is to create dense vector representations of the knowledge base in order to retrieve text chunks that are semantically similar to a given user query.

The Granite RAG 3.0 8b adds hallucination detection and citation generation capability.

RAG use cases include:
- Customer service: Answering questions about a product or service using facts from the product documentation.
- Domain knowledge: Exploring a specialized domain (e.g., finance) using facts from papers or articles in the knowledge base.
- News chat: Chatting about current events by calling up relevant recent news articles.

In its simplest form, RAG requires 3 steps:

- Initial setup:
  - Index knowledge-base passages for efficient retrieval. In this recipe, we take embeddings of the passages using WatsonX, and store them in a vector database.
- Upon each user query:
  - Retrieve relevant passages from the database. In this recipe, we use an embedding of the query to retrieve semantically similar passages.
  - Generate a response by feeding retrieved passage into a large language model, along with the user query.

## Setting up the environment

Ensure you are running python 3.10 or 3.11 in a freshly-created virtual environment.

In [None]:
import sys
assert sys.version_info >= (3, 10) and sys.version_info < (3, 12), "Use Python 3.10 or 3.11 to run this notebook."

### Install dependencies

Granite utils includes some helpful functions.

In [None]:
! pip install "git+https://github.com/ibm-granite-community/utils.git" \
    transformers \
    peft \
    "langchain_community" \
    langchain-huggingface \
    langchain-milvus \
    docling

## Selecting System Components

### Choose your Embeddings Model

Specify the model to use for generating embedding vectors from text.

To use a model from another provider, replace this code cell with one from [this Embeddings Model recipe](https://github.com/ibm-granite-community/granite-kitchen/blob/main/recipes/Components/Langchain_Embeddings_Models.ipynb).

In [None]:
from langchain_huggingface import HuggingFaceEmbeddings

embeddings_model = HuggingFaceEmbeddings(model_name="ibm-granite/granite-embedding-30m-english")

### Choose your Vector Database

Specify the database to use for storing and retrieving embedding vectors.

To connect to a vector database other than Milvus, replace this code cell with one from [this Vector Store recipe](https://github.com/ibm-granite-community/granite-kitchen/blob/main/recipes/Components/Langchain_Vector_Stores.ipynb).

In [None]:
from langchain_milvus import Milvus
import tempfile

db_file = tempfile.NamedTemporaryFile(prefix="milvus_", suffix=".db", delete=False).name
print(f"The vector database will be saved to {db_file}")

vector_db = Milvus(
    embedding_function=embeddings_model,
    connection_args={"uri": db_file},
    auto_id=True,
    enable_dynamic_field=True,
    index_params={"index_type": "AUTOINDEX"},
)

## Use the Granite RAG 3.0 8b model

Create a model object for the Granite RAG 3.0 8b model on your workstation. This can take quite a bit of memory (> 16 GB).

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft.peft_model import PeftModel
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu')

tokenizer = AutoTokenizer.from_pretrained('ibm-granite/granite-3.0-8b-instruct', padding_side='left', trust_remote_code=True)

model_base = AutoModelForCausalLM.from_pretrained('ibm-granite/granite-3.0-8b-instruct')
model_lora = PeftModel.from_pretrained(model_base, 'ibm-granite/granite-rag-3.0-8b-lora')
model = model_lora.to(device)

## Building the Vector Database

In this example, from a set of source documents, we use [Docling](https://docling-project.github.io/docling/) to convert the documents into text and then split the text into chunks, derive embedding vectors using the embedding model, and load it into the vector database for querying.

### Use Docling to download the documents, convert to text, and split into chunks


Here we use a set of web pages about IBM and the US Open. For each source web page, we convert the web page into a DoclingDocument and then chunk the DoclingDocument. Finally LangChain Documents are created for all the chunks labeled text or paragraph. The Documents are annotated with metadata to define a unique document id and the source of the document.

In [None]:
from docling.document_converter import DocumentConverter
from docling_core.transforms.chunker.hierarchical_chunker import HierarchicalChunker
from docling_core.types.doc.labels import DocItemLabel
from langchain_core.documents import Document

sources = [
    "https://www.ibm.com/case-studies/us-open",
    "https://www.ibm.com/sports/usopen",
    "https://newsroom.ibm.com/US-Open-AI-Tennis-Fan-Engagement",
    "https://newsroom.ibm.com/2024-08-15-ibm-and-the-usta-serve-up-new-and-enhanced-generative-ai-features-for-2024-us-open-digital-platforms",
]

converter = DocumentConverter()
i = 0
texts: list[Document] = [
    Document(page_content=chunk.text, metadata={"doc_id": (i:=i+1), "source": source})
    for source in sources
    for chunk in HierarchicalChunker().chunk(converter.convert(source=source).document)
    if any(filter(lambda c: c.label in [DocItemLabel.TEXT, DocItemLabel.PARAGRAPH], iter(chunk.meta.doc_items)))
]

print(f"{len(texts)} documents created")

### Populate the vector database

NOTE: Population of the vector database may take over a minute depending on your embedding model and service.

In [None]:
ids = vector_db.add_documents(texts)
print(f"{len(ids)} documents added to the vector database")

## Querying the Vector Database

We define the query to use for the RAG operation.

In [None]:
query = "How did IBM use watsonx at the 2024 US Open Tennis Championship?"

### Conduct a similarity search

Search the database for similar documents by proximity of the embedded vector in vector space to demonstrate the similarity search used during the RAG operation.

In [None]:
docs = vector_db.similarity_search(query)
print(f"{len(docs)} documents returned")
for d in docs:
    print(f"doc_id={d.metadata['doc_id']}: {d.page_content}")

## Answering Questions

### Create the prompt for Granite RAG 3.0 8b

For Granite RAG 3.0 8b, we construct the prompt in a specific JSON format which includes the retrieved documents and metadata about the information to be included in the response.

In [None]:
import json

instruction = {
  "instruction": """Respond to the user's latest question based solely on the information provided in the documents.
Ensure that your response is strictly aligned with the facts in the provided documents.
If the information needed to answer the question is not available in the documents, inform the user that the question cannot be answered based on the available data.
Make sure that your response follows the attributes mentioned in the 'meta' field.""",
  "documents": [
    {
      "doc_id": d.metadata['doc_id'],
     "text": d.page_content,
    }
    for d in docs
  ],
  "meta": {
    "hallucination_tags": True,
    "citations": False,
  },
}
conversation = [
    {"role": "system", "content": json.dumps(instruction)},
    {"role": "user", "content": query},
]

### Generate a retrieval-augmented response to a question

Use the documents from the similarity search are used as context. The response from Granite RAG 3.0 8b in a JSON document. This cell then parses the JSON document to retrieve the sentences of the response along with metadata about the sentence which can be used to guide the displayed output.

In [None]:
from langchain_core.utils.json import parse_json_markdown

prompt = tokenizer.apply_chat_template(conversation=conversation, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt")
output = model.generate(inputs["input_ids"].to(device), attention_mask=inputs["attention_mask"].to(device), max_new_tokens=1000)
answer = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)

print(f"Question:\n{query}")
print("\nAnswer:")
try:
    responses = parse_json_markdown(answer)
    need_footnote = False
    for response in responses:
        sentence = response.get("sentence")
        meta = response.get("meta", {})
        hallucination_level = meta.get("hallucination_level", "low")
        match hallucination_level:
            case "low" | "unanswerable":
                 print(sentence)
            case "high" | _:
                need_footnote = True
                print(sentence, "¹", sep="")
    if need_footnote:
        print("\n¹ Warning: the sentence was not generated using the retrieved documents.")
    print("\nOriginal response in JSON format:")
    print(json.dumps(responses, indent=2))
except json.JSONDecodeError:
    print("\nOriginal response which was unable to be parsed as JSON:")
    print(answer)