# Part 3: Building a Private RAG Pipeline with a Vector Database

Welcome to the final part of our series! So far, we have a private, secure, and optimized LLM running locally. But it's a generalist. It knows about Music Television, but not our internal tools like the **Migration Toolkit for Virtualization (MTV)**.

In this notebook, we will solve this by building a Retrieval-Augmented Generation (RAG) pipeline. This will give our LLM a 'memory' of our own private data, without ever needing to be retrained.

<img src="documentation.jpeg" width="400">

## What is Retrieval-Augmented Generation (RAG)?

RAG is a technique for providing an LLM with information from an external knowledge base. Instead of the LLM relying solely on its training data, we first 'retrieve' relevant information and then pass that information to the LLM as context along with the user's query.

The process looks like this:
1.  **User Query:** The user asks a question (e.g., "What is MTV?").
2.  **Search/Retrieve:** We search our private knowledge base (a collection of documents) for text chunks that are relevant to the query.
3.  **Augment Prompt:** We create a new prompt for the LLM, stuffing it with the relevant text we found. It looks something like: `"Given this context: [relevant text about Migration Toolkit for Virtualization], answer this question: What is MTV?"`
4.  **Generate:** The LLM, now equipped with the correct context, generates an accurate answer.

To do the 'Retrieve' step efficiently, we need a **Vector Database**.

## Setting Up the RAG Pipeline

We'll use a few key libraries:
- **`requests` & `BeautifulSoup4`:** To scrape the documentation for the Migration Toolkit for Virtualization from the web.
- **`langchain` & `langchain-community`:** A popular framework that simplifies building applications with LLMs. We'll use it to manage our data, prompts, and the RAG chain itself.
- **`sentence-transformers`:** To generate embeddings (numerical representations) of our text.
- **`faiss-cpu`:** A library for efficient similarity search. This will be our local vector database.

In [None]:
!pip install langchain langchain-community langchain-huggingface sentence-transformers faiss-cpu beautifulsoup4 requests optimum[openvino] transformers torch accelerate

### Step 1: Load Our Private Data

First, we need to get the text data for our knowledge base. We'll use two different loaders:
1. **HTML Loader**: Scrapes data directly from the Red Hat documentation page
2. **Markdown Loader**: Loads data from a local markdown file (`kubectl-mtv.md`)

This gives us a comprehensive knowledge base combining both web documentation and local files.

In [1]:
import requests
from bs4 import BeautifulSoup
import os

def load_data_from_html_url(url):
    """HTML Loader: Scrape data from a web URL"""
    print(f"Scraping data from {url}...")
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    # This selector is specific to the Red Hat documentation page structure
    # It might need adjustment for other pages
    article_body = soup.find('div', class_='docs-content-container')
    if article_body:
        print("HTML data scraped successfully.")
        return article_body.get_text(separator='\n', strip=True)
    else:
        raise ValueError("Could not find the main content of the page.")

def load_data_from_markdown_file(file_path):
    """Markdown Loader: Load data from a local markdown file"""
    print(f"Loading markdown data from {file_path}...")
    if not os.path.exists(file_path):
        raise FileNotFoundError(f"Markdown file not found: {file_path}")
    
    with open(file_path, 'r', encoding='utf-8') as file:
        content = file.read()
    
    print("Markdown data loaded successfully.")
    return content

# Load data from HTML URL
url = "https://docs.redhat.com/en/documentation/migration_toolkit_for_virtualization/2.0/html/installing_and_using_the_migration_toolkit_for_virtualization/installing-mtv_mtv"
html_docs_text = load_data_from_html_url(url)

# Load data from local markdown file
markdown_file_path = "kubectl-mtv.md"
markdown_docs_text = load_data_from_markdown_file(markdown_file_path)

# Combine both data sources
mtv_docs_text = html_docs_text + "\n\n" + markdown_docs_text
print(f"Combined documentation loaded: {len(html_docs_text)} chars from HTML + {len(markdown_docs_text)} chars from Markdown")

Scraping data from https://docs.redhat.com/en/documentation/migration_toolkit_for_virtualization/2.0/html/installing_and_using_the_migration_toolkit_for_virtualization/installing-mtv_mtv...
HTML data scraped successfully.
Loading markdown data from kubectl-mtv.md...
Markdown data loaded successfully.
Combined documentation loaded: 8403 chars from HTML + 13070 chars from Markdown


### Step 2: Chunk the Data

LLMs have a limited context window. We can't just feed them the entire documentation at once. We need to split the text into smaller, meaningful chunks.

In [2]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, 
    chunk_overlap=100
)
chunks = text_splitter.split_text(mtv_docs_text)
print(f"Split data into {len(chunks)} chunks.")

Split data into 25 chunks.


### Step 3: Create Embeddings and Store in Vector DB

Now for the magic. We'll convert each text chunk into a vector (a list of numbers) using an **embedding model**. These vectors capture the semantic meaning of the text. We'll use a small, efficient embedding model from IBM's Granite family. Then we'll store these vectors in our FAISS vector database.

When a user asks a question, we'll convert their question into a vector too and use FAISS to find the text chunks with the most similar vectors.

In [3]:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS

embedding_model_name = "ibm-granite/granite-embedding-30m-english"
print(f"Loading embedding model: {embedding_model_name}")
embeddings = HuggingFaceEmbeddings(model_name=embedding_model_name)

print("Creating vector database...")
vector_store = FAISS.from_texts(chunks, embedding=embeddings)
retriever = vector_store.as_retriever(search_kwargs={"k": 3})  # Retrieve top 3 most relevant chunks
print("Vector database is ready.")

Loading embedding model: ibm-granite/granite-embedding-30m-english
Creating vector database...
Vector database is ready.


### Step 4: Build the RAG Chain

We have all the pieces. Now we'll use LangChain to assemble them into a single, runnable chain. This chain will automatically handle retrieving the context, formatting the prompt, and calling our local, optimized LLM.

In [4]:
from optimum.intel import OVModelForCausalLM
from transformers import AutoTokenizer, pipeline
from langchain_huggingface import HuggingFacePipeline
from langchain.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
import time
import os

# Set tokenizer parallelism to avoid warnings
os.environ['TOKENIZERS_PARALLELISM'] = 'false'

# Load our optimized LLM from Part 2
llm_model_name = "ibm-granite/granite-3.3-2b-instruct"
optimized_model_dir = "./granite-2b-openvino"
print("Loading optimized LLM...")
ov_model = OVModelForCausalLM.from_pretrained(optimized_model_dir)
tokenizer = AutoTokenizer.from_pretrained(llm_model_name)

# Create a Hugging Face pipeline with updated parameters
pipe = pipeline(
    "text-generation", 
    model=ov_model, 
    tokenizer=tokenizer, 
    max_new_tokens=512,
    do_sample=True,
    temperature=0.3,
    top_p=0.95,
    return_full_text=False  # Only return the generated text, not the input
)
llm = HuggingFacePipeline(pipeline=pipe)

# Define the prompt template with better formatting
template = """You are a helpful assistant with expertise in virtualization and migration technologies. 
Use the following pieces of context to answer the question at the end. 
If you don't know the answer based on the provided context, just say that you don't know, don't try to make up an answer.

Context: {context}

Question: {question}

Helpful Answer:"""
prompt_template = PromptTemplate.from_template(template)

# Function to format documents for better context
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# Assemble the RAG chain with improved document formatting
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt_template
    | llm
    | StrOutputParser()
)

print("RAG chain is ready.")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Loading optimized LLM...


Device set to use cpu


RAG chain is ready.


### Step 5: Ask the Right Question!

Now, let's ask our question again. This time, the RAG chain will fetch the relevant context from our vector database and provide it to the LLM.

In [5]:
question = "What is MTV?"

print(f"Querying RAG chain with: '{question}'")
start_time = time.time()
response = rag_chain.invoke(question)
end_time = time.time()
duration = end_time - start_time

print(f"\nResponse generated in {duration:.2f} seconds.")
print("\n---\n")
print(response)

Querying RAG chain with: 'What is MTV?'

Response generated in 24.28 seconds.

---

 MTV stands for Migration Toolkit for Virtualization. It is an open-source project that enables the migration of virtual machines (VMs) between different virtualization platforms, such as VMware, Red Hat Virtualization, and OpenShift Virtualization. MTV provides a set of tools and operators to automate and streamline the migration process, making it easier for administrators to move workloads from legacy environments to modern cloud platforms.

The MTV Operator is a Kubernetes operator that simplifies the deployment and management of MTV components within an OpenShift cluster. It allows users to install, configure, and manage the Forklift/MTV components, including the ForkliftController, which is responsible for managing the migration process.

The provided context describes the steps to install and configure MTV, including the installation of the OpenShift Container Platform web console, OpenShift Virt

## Conclusion: From Generic to Genius

Look at the difference! The first time we asked, we got an answer about a music channel. Now, with RAG, our model correctly identifies MTV as the **Migration Toolkit for Virtualization** and provides a detailed, accurate description based on the documentation we provided.

We have successfully built a complete, end-to-end private AI pipeline:
1.  **Secure:** All data and models remain on our local machine.
2.  **Fast:** Optimized with Intel OpenVINO for high-speed inference.
3.  **Smart:** Augmented with our own private data using RAG.

This architecture is the blueprint for building powerful, secure, and context-aware AI applications for any organization that handles sensitive data.