<a href="https://colab.research.google.com/github/runf21/NIM_RAG/blob/main/Enterprise_RAG_Blueprint_EXT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Enterprise RAG Blueprint


<img src =https://developer-blogs.nvidia.com/zh-cn-blog/wp-content/uploads/sites/2/2023/11/GenAI-Promo-AWS-DevNews-PRESS-1920x1080-1-960x540.png>


# Introduction

The advent of generative AI has sparked a revolution across industries, opening doors to innovations that were once the realm of science fiction.

At the forefront of this movement is a class of deep learning algorithms called a Large Language Model (LLM). These LLMs can recognize, summarize, translate, predict, and generate content after being trained on massive datasets.

Despite their extraordinary potential, implementing AI applications at scale presents significant challenges, demanding cutting-edge infrastructure and specialized expertise.

NVIDIA addresses these challenges head-on. As a pioneer in AI computing, NVIDIA has created a comprehensive ecosystem to efficiently build, train, and deploy LLMs. Today, we'll explore this ecosystem while constructing a Retrieval-Augmented Generation (RAG) system - an advanced LLM-powered enterprise chatbot.

## NVIDIA NGC

At the core of the NVIDIA software platform is the NVIDIA GPU cloud (NGC) - a cloud-based platform that provides access to GPU-optimized AI software, including pre-trained models, frameworks, and SDKs. It serves as a hub for AI developers to discover, download, and deploy GPU-accelerated software for their AI workflows.

<img src="https://i.ibb.co/TB4dL4BL/NGC-website.png" width="100%">

NGC streamlines the development process by offering containerized applications and enterprise-grade support, allowing teams to quickly implement AI solutions without building everything from scratch.

## Foundational Models and NIMs
NGC hosts NVIDIA's Foundational Models - a collection of curated, pre-trained models that serve as building blocks for AI applications. These models are optimized by NVIDIA to deliver the best performance and deployed as a NVIDIA Inference Microservice (NIM). These NIMs package models into containerized, production-ready services with standardized APIs, dramatically simplifying deployment and scaling.

<img src="https://i.ibb.co/cd85XDM/NVIDIA-API-catalog.png" width="100%">

NVIDIA hosted deployments of NIMs are available to test on the [NVIDIA Build page](https://build.nvidia.com/). After testing, NIMs can be exported from NVIDIA’s API catalog using the NVIDIA AI Enterprise license and run on-premises or in the cloud, giving enterprises ownership and full control of their IP and AI application.

<img src="https://i.ibb.co/v6RPbrBw/NVIDIA-build-demo.png" width="100%">

# Prerequisites

## Using this notebook
Google Colab is a free, cloud-based platform that allows you to write and execute Python code in a Jupyter notebook environment directly from your browser. It provides access to powerful computing resources, including GPUs, making it ideal for machine learning, data analysis, and educational projects.

To save your changes in this notebook, you will need to make your own copy.

The notebook consists of cells where you can write and run Python code. To execute the code in a cell, simply click on the cell and press the "Play" button or use `Shift + Enter`.

## Installing Dependencies

Let's install all the required libraries and dependencies we need for our chatbot to work correctly. A core dependency is **LangChain** - a popular software framework that provides modular components for building LLM applications.

In [None]:
import os
import subprocess
import sys

# NumPy fix
desired_version = "1.26.4"

try:
    import numpy as np
    current_version = np.__version__
    print(f"Current NumPy version: {current_version}")

    if current_version != desired_version:
        print(f"Installing NumPy version {desired_version}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", f"numpy=={desired_version}"])

        print("Restarting runtime to apply changes...")
        os.kill(os.getpid(), 9)
    else:
        print("NumPy is already the desired version.")

except ImportError:
    print("NumPy is not installed. Installing...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", f"numpy=={desired_version}"])
    os.kill(os.getpid(), 9)


In [None]:
%pip install --quiet \
    chainlit==1.3.2 \
    chromadb==0.5.20 \
    dataclasses-json==0.6.7 \
    fastapi==0.115.5 \
    kaleido==0.2.1 \
    langchain==0.3.0 \
    langchain-chroma==0.1.4 \
    langchain-community==0.3.0 \
    langchain-nvidia-ai-endpoints==0.3.5 \
    langchain-unstructured==0.1.6 \
    protobuf==4.25.2 \
    pydantic==2.9.2 \
    pymupdf==1.25.3 \
    "unstructured[all-docs]"==0.17.2

We'll also install some system dependencies to support data processing for our data loader. These packages enable document handling, text extraction, and optical character recognition (OCR).

In [None]:
!apt-get -qq install libmagic-dev poppler-utils tesseract-ocr libreoffice > /dev/null

Finally we'll download and install Tunnelmole - an open-source application that provides a public URL for a web server running locally. This will allow us to expose a frontend web app for the chatbot we'll create and run in this notebook.

In [None]:
!curl -O https://install.tunnelmole.com/n3d5g/install && sudo bash install > /dev/null

## Apply for an NVIDIA NGC Account
To use the NVIDIA AI Foundational Models and NIMs, you'll need to apply for an account on NVIDIA NGC.

---

### Steps
1. Go to the [NVIDIA NGC page](https://ngc.nvidia.com/signin)
2. Click the **Welcome Guest** icon in the top right corner of the page and select **Sign In / Sign Up**.

<img src ="https://i.ibb.co/Kx8pz3Fw/welcome-guest-signin.pnghttps://i.ibb.co/Kx8pz3Fw/welcome-guest-signin.png" width="50%">

3. Enter your email address and click **Continue**.

<img src ="https://docscontent.nvidia.com/dims4/default/45077b7/2147483647/strip/true/crop/462x585+0+0/resize/924x1170!/format/webp/quality/90/?url=https%3A%2F%2Fk3-prod-nvidia-docs.s3.us-west-2.amazonaws.com%2Fbrightspot%2Fdita%2F00000195-414b-d50d-a1bd-ed5fdd100000%2Fngc%2Fgpu-cloud%2Fcommon%2Fgraphics%2Fgraphics-ngc%2Flogin-new-org.png" width="40%">

4. In this step, you'll create your new account by creating a new password and confirming it. Make sure to review the NVIDIA Account Terms of Use and Privacy Policy, and click **Create Account** to accept and proceed with account creation.

<img src ="https://docscontent.nvidia.com/dims4/default/90e83cc/2147483647/strip/true/crop/668x633+0+0/resize/1336x1266!/format/webp/quality/90/?url=https%3A%2F%2Fk3-prod-nvidia-docs.s3.us-west-2.amazonaws.com%2Fbrightspot%2Fdita%2F00000195-414b-d50d-a1bd-ed5fdd100000%2Fngc%2Fgpu-cloud%2Fcommon%2Fgraphics%2Fgraphics-ngc%2Fcreate-an-account-dark.png" width="50%">

5. You should shortly receive an email to verify your registration. Inside the email, click **Verify Email Address**.

<img src ="https://docscontent.nvidia.com/dims4/default/5dac44d/2147483647/strip/true/crop/605x628+0+0/resize/1210x1256!/format/webp/quality/90/?url=https%3A%2F%2Fk3-prod-nvidia-docs.s3.us-west-2.amazonaws.com%2Fbrightspot%2Fdita%2F00000195-414b-d50d-a1bd-ed5fdd100000%2Fngc%2Fgpu-cloud%2Fcommon%2Fgraphics%2Fgraphics-ngc%2Faccount-created-email.png" width="50%">

6. You are automatically directed to nvidia.com and see an email verified successfully page. This window will close automatically.

<img src ="https://docscontent.nvidia.com/dims4/default/2f7c26a/2147483647/strip/true/crop/685x293+0+0/resize/1370x586!/format/webp/quality/90/?url=https%3A%2F%2Fk3-prod-nvidia-docs.s3.us-west-2.amazonaws.com%2Fbrightspot%2Fdita%2F00000195-414b-d50d-a1bd-ed5fdd100000%2Fngc%2Fgpu-cloud%2Fcommon%2Fgraphics%2Fgraphics-ngc%2Femail-verified-dark.png" width="50%">

7. At the Almost done! dialog, set your communications preferences, and then click **Submit**.

<img src ="https://docscontent.nvidia.com/dims4/default/d726cec/2147483647/strip/true/crop/664x399+0+0/resize/1328x798!/format/webp/quality/90/?url=https%3A%2F%2Fk3-prod-nvidia-docs.s3.us-west-2.amazonaws.com%2Fbrightspot%2Fdita%2F00000195-414b-d50d-a1bd-ed5fdd100000%2Fngc%2Fgpu-cloud%2Fcommon%2Fgraphics%2Fgraphics-ngc%2Falmost-done-dark-slim.png" width="50%">

8. Enter the password you just created to continue setting up your NVIDIA Cloud Account. (This is a required security measure).

<img src ="https://docscontent.nvidia.com/dims4/default/22fd238/2147483647/strip/true/crop/657x562+0+0/resize/1314x1124!/format/webp/quality/90/?url=https%3A%2F%2Fk3-prod-nvidia-docs.s3.us-west-2.amazonaws.com%2Fbrightspot%2Fdita%2F00000195-414b-d50d-a1bd-ed5fdd100000%2Fngc%2Fgpu-cloud%2Fcommon%2Fgraphics%2Fgraphics-ngc%2Fngc-sign-in-existing-user-dark.png" width="50%">

9. Give your NVIDIA Cloud Account (NCA) a name that will help you identify it easily the next time you sign-in.

<img src ="https://docscontent.nvidia.com/dims4/default/0088a68/2147483647/strip/true/crop/603x679+0+0/resize/1206x1358!/format/webp/quality/90/?url=https%3A%2F%2Fk3-prod-nvidia-docs.s3.us-west-2.amazonaws.com%2Fbrightspot%2Fdita%2F00000195-414b-d50d-a1bd-ed5fdd100000%2Fngc%2Fgpu-cloud%2Fcommon%2Fgraphics%2Fgraphics-ngc%2Fcreate-nvidia-cloud-account.png" width="50%">

10. Complete your user profile at the Set Your Profile screen, agree to the NVIDIA GPU Cloud Terms of Use, and then click Submit.

<img src ="https://docscontent.nvidia.com/dims4/default/5d44ae7/2147483647/strip/true/crop/806x577+0+0/resize/1612x1154!/format/webp/quality/90/?url=https%3A%2F%2Fk3-prod-nvidia-docs.s3.us-west-2.amazonaws.com%2Fbrightspot%2Fdita%2F00000195-414b-d50d-a1bd-ed5fdd100000%2Fngc%2Fgpu-cloud%2Fcommon%2Fgraphics%2Fgraphics-ngc%2Fset-your-profile.png" width="50%">

11. Your NVIDIA NGC account is created and you are automatically redirected to your individual NGC org.

<img src ="https://docscontent.nvidia.com/dims4/default/7349fc5/2147483647/strip/true/crop/1616x856+0+0/resize/2880x1526!/format/webp/quality/90/?url=https%3A%2F%2Fk3-prod-nvidia-docs.s3.us-west-2.amazonaws.com%2Fbrightspot%2Fdita%2F00000195-414b-d50d-a1bd-ed5fdd100000%2Fngc%2Fgpu-cloud%2Fcommon%2Fgraphics%2Fgraphics-ngc%2Fngc-default-landing-page.png" width="50%">




## Generating an NGC API key

### Steps
1. Navigate to the [NGC API Catalog page](https://build.nvidia.com/)
2. Select a model.
3. Select **Get API Key** and login if prompted.
<img src="https://docs.nvidia.com/nim/large-language-models/latest/_images/build_get_api_key.png" width="100%">
4. Select **Generate Key**
<img src="https://docs.nvidia.com/nim/large-language-models/latest/_images/build_generate_key.png" width="100%">
5. Your API key appears in the following dialog. Copy and save this API key. **IMPORTANT** Keep this somewhere safe because it will only display it once! If you lose your key, you'll have to create a new one.
<img src="https://docs.nvidia.com/nim/large-language-models/latest/_images/build_copy_key.png" width="100%">

## Enter NGC API Key

Time to set our NGC API key! Run this cell, input your key in the box that shows up below, and press `ENTER` on your keyboard.

In [None]:
import getpass
import os

def set_ngc_api_key():
    """Prompt the user to enter an NVIDIA API key if it's not set or invalid."""
    while True:
        nvapi_key = getpass.getpass("Enter your NVIDIA API key: ")

        if nvapi_key.startswith("nvapi-"):
            os.environ["NVIDIA_API_KEY"] = nvapi_key
            print("NVIDIA API Key has been successfully set!")
            break
        else:
            print("Invalid API key. Please try again.")

# Check if the key is already set and valid
current_key = os.environ.get("NVIDIA_API_KEY", "")

if not current_key.startswith("nvapi-"):
    print("NVIDIA API Key is missing or invalid. Please enter a new key.")
    set_ngc_api_key()
else:
    print("NVIDIA API Key is already set.")
    change_key = input("Would you like to enter a different key? (yes/no): ").strip().lower()

    if change_key in ["yes", "y"]:
        set_ngc_api_key()


## Runtimes

This notebook runs on a standard CPU runtime without requiring a local GPU. While GPUs are essential for efficient model inference, all inference in this notebook is handled through cloud-based NVIDIA Inference Microservices (NIMs) hosted on the NVIDIA NGC API catalog and accessible via our NGC API key.

Your runtime should already be set up, but you can verify your runtime type in the top right corner of the navigation bar. Initially, before running any code cells, the runtime will be in a disconnected state. You can either manually activate it by clicking the runtime button or let it automatically connect when you run your first code cell.

<img src="https://i.ibb.co/spG4Fttw/Connect-button.png" width="20%">

Sometimes it's neccesary to change your runtime, you can do this by clicking the down arrow and selecting "Change runtime type".

<img src ="https://i.ibb.co/CM932jJ/change-runtime.png" width="60%">

The NVIDIA A100 and L4 GPUs are among the best available, but they require credits which cost money. They're also far more powerful than necessary for today's purposes since our inference is being hosted by NGC. We'll opt for the free CPU runtime offered by Google instead.

<img src="https://i.ibb.co/9fpd3ww/runtime-type.png" width="60%">

With the prerequisites complete, it's time to start building our application! But first - let's review the fundamental concepts of Retrieval-Augmented Generation: what it is, how it works, and the key components that power it.

# RAG 101

## A Very Very Short History of RAG
Retrieval-Augmented Generation, or RAG for short, is one of the most exciting advancements in generative AI. Patrick Lewis, lead author of the 2020 paper that coined the term, apologized for the unflattering acronym that now describes a growing family of methods across hundreds of papers and dozens of commercial services he believes represent the future of generative AI.
“We definitely would have put more thought into the name had we known our work would become so widespread,” Lewis said in an interview from Singapore, where he was sharing his ideas with a regional conference of database developers.
“We always planned to have a nicer sounding name, but when it came time to write the paper, no one had a better idea,” said Lewis, who now leads a RAG team at AI startup Cohere.

<img src ="https://i.ibb.co/pvmYj2G/patrick-lewis.jpg" width="50%">
<img src ="https://i.ibb.co/Tq81dt3/RAG-whitepaper.png" width="50%">

## RAG Revealed
Before we discuss the technical details of a RAG, let’s use a real-world analogy. Imagine a courtroom. Judges hear and decide cases based on their general understanding of the law. Sometimes a case — like a malpractice suit or a labor dispute — requires special expertise, so judges send court clerks to a law library, looking for precedents and specific cases they can cite.

Like a good judge, LLMs can respond to a wide variety of human queries. But to deliver authoritative answers that cite sources, the model needs an assistant to do some research. Think of a RAG as the court clerk. It’s essentially a second neural network model that is acting as a helper for the original LLM or the judge in our example.

<img src =https://i.ibb.co/4p028MS/RAG-courtroom.png width="90%">

### Embedding Models and Vector Databases
This second model is called an embedding model. It converts the knowledge base it’s been given to a numerical format known as an **embedding vector** that is stored in a **vector database**. Think of a vector as a set of coordinates like latitude and longitude on a map – a list of numbers that points to the location of a piece of information. When you use GPS on your phone to search for the nearest restaurant, an algorithm is calculating the difference between your coordinates and a list of coordinates in its database. The list of restaurants it returns have the highest closeness to your location vector. This is essentially how a vector database works in our RAG.

When we query the vector database, we employ a method called Semantic Search. This data searching technique uses the intent and contextual meaning behind a query to deliver more accurate results. Unlike keyword search, which focuses on matching specific words or synonyms, semantic search captures the overall meaning of the query. By considering the context and relationships between words, it delivers results that are more relevant to what the user is actually looking for.

<img src =https://i.ibb.co/kxNQbXY/Vector-database-slide.png>

### ReRanker Models
While semantic search helps us find the most relevant documents based on meaning rather than exact keywords, not all retrieved results are equally useful. Some may be highly relevant, while others might only be loosely related. This is where re-ranking comes into play.

Re-ranking acts as a second stage in the retrieval process, ensuring that the most important and contextually accurate documents are prioritized. By leveraging the advanced language understanding of large language models (LLMs), re-ranking refines search results, making our RAG system even more effective.

<img src="https://i.ibb.co/zVXSHrP8/Re-Ranker-Diagram.png" width="100%">

### Connecting the Dots

Let’s bring everything together with a high-level overview of a RAG workflow.

**Phase 1: Document Ingestion**

In the first phase (top section of the diagram), data is ingested from a knowledge base, which may include documents, PDFs, APIs, or structured databases. This data is preprocessed by chunking it into smaller segments before being transformed into numerical embeddings by the embedding model NIM. These embeddings are then stored in a vector database, allowing for fast and efficient retrieval later.

**Phase 2: User Query & Retrieval**

When a user submits a query, the system first converts it into an embedding and searches the vector database using semantic search to find the most relevant information. The retriever pulls these matching results as an initial context for the response.

Some retrieved results may be more relevant than others, which is where the ReRanker NIM comes in — analyzing the retrieved chunks and reordering them to ensure the most precise and valuable information is prioritized.

The LLM NIM then receives this refined context, generating a response that is more accurate, contextually relevant, and informed by the knowledge base.

This approach ensures that the chatbot doesn’t just retrieve documents—it delivers the best, most relevant insights to enhance user interactions.

<img src="https://i.ibb.co/TBDXjwG7/RAG-Re-Ranker-Diagram.png" width="100%">

Now that we understand the concepts behind Retrieval-Augmented Generation (RAG), let’s build our RAGbot!

**Remember you need to have run all of the cells in the prerequisites section for this to work.**

Now that we understand the concepts behind Retrieval-Augmented Generation (RAG), let’s build our RAGbot! Unlike a traditional LLM-powered chatbot, a RAG bot can retrieve relevant, up-to-date information from external data sources!  

**Remember you need to have run all of the cells in the prerequisites section for this to work.**


# Concept to Code: Building a RAG

## Constructing the Knowledge Base

The first step in building our RAG system is to create and index the knowledge base — the backbone of its architecture. As we explored in RAG 101, effective retrieval is essential for generating high-quality responses in our final application. Once the pipeline is established, the knowledge base can be updated dynamically, but initial data ingestion must occur before the LLM can retrieve and utilize the information.





### Downloading the data
Let's start with our data. For today’s RAG, we’ve sourced a PDF file containing documentation on NVIDIA Base Command Manager. NVIDIA Base Command Manager (BCM) is a cluster management software for high-performance computing (HPC) environments. It enables workload scheduling, resource allocation, and system monitoring across CPU and GPU clusters. BCM supports job submission through workload managers like Slurm, PBS, and Kubernetes, and integrates with tools like Jupyter and Spark for AI and data processing.

Let's download it below and check that the file is in our local directory.

In [None]:
!wget -O "BCM_User_Manual.pdf" -nc https://support.brightcomputing.com/manuals/10/user-manual.pdf

<img src="https://i.ibb.co/GfjJgYfx/BCM-user-file.png" width="50%">

Let's take a look at a few sample pages from the NVIDIA Bright Cluster Manager User Manual to understand the type of content our knowledge base will be ingesting. Notice the technical content, formatting, and structure of the information - these are all elements that our RAG system will need to understand and reference correctly when answering questions.

PDF documents often present unique challenges: complex elements like tables, diagrams, equations, and multi-column layouts are often lost in the extraction process. Effective RAG systems require thoughtful preprocessing of PDF content - potentially using specialized extraction techniques for tables, OCR for image-based text, and custom parsing to preserve document structure.

In [None]:
import fitz
from IPython.display import display, HTML
import base64

def show_specific_pages(pdf_path, page_numbers):
    doc = fitz.open(pdf_path)

    html_content = '<div style="display: flex; flex-direction: column; gap: 20px;">'

    for page_num in page_numbers:
        # Adjust for 0-based indexing (subtract 1 from page number)
        i = page_num - 1

        # Skip if page number is out of range
        if i < 0 or i >= doc.page_count:
            continue

        page = doc.load_page(i)
        pix = page.get_pixmap()

        # Convert to PNG bytes
        png_bytes = pix.tobytes("png")

        # Encode the image to base64
        img_base64 = base64.b64encode(png_bytes).decode('utf-8')

        # Add page to HTML
        html_content += f'''
        <div style="border: 1px solid #ddd; padding: 10px;">
            <h3>Page {page_num}</h3>
            <img src="data:image/png;base64,{img_base64}" style="width: 50%;">
        </div>
        '''

    html_content += '</div>'
    return HTML(html_content)

# Show pages 9, 52, and 73
show_specific_pages("BCM_User_Manual.pdf", [9, 52, 73])

### Preprocessing the Data

To incorporate external data into our system, we'll first need to use a set of tools named **Document Loaders** to transform our content into structured `Document` objects, each containing both text content and its associated metadata. While we'll be using a PDF for our example today, there are Document Loaders for an ever-expanding variety of data types including plain text, Word, Powerpoint, CSV, HTML, images, and more...

For our RAG build, we'll be using `Unstructured` - an open-source toolkit designed to simplify the ingestion and pre-processing of diverse data formats with a focus on optimizing data for LLMs. It can handle the whole process from creating the `Document` objects to preprocessing which we'll cover next.



---



#### Splitting and Chunks
Before we processing our new data, we need to divide our `Document` objects into smaller, manageable segments - a practice known as splitting or chunking.

This prepocessing step offers several benefits, such as ensuring consistent processing of varying document lengths, overcoming input size limitations of LLMs, and improving the quality of text representations used in retrieval systems.

For our splitting strategy today, we'll use a powerful approach called document-based splitting, built into the `Unstructured` toolkit. Instead of breaking cotnten at fixed intervals, this technique intelligently follows the document's natural structure - using paragraphs, sections, and other meaningful boundaries as splitting points. The resulting **chunks** will retain useful context and greatly improve the retrieval quality.

If you're interested in learning more about splitting in general and the different splitting approaches, check out the deep dive section below. Otherwise, let's move on to implementing Unstructured and doing our splitting in the code block below.

> #### Deep Dive: Understanding Text Splitting

There's many ways of splitting data into chunks but usually they fall under a series of approaches:
- **Length**: The most straightforward and intuitive approach splits documents based on length, using either character count or token count as the measuring unit.
 - **Character-based**: Splitting on the number of characters in a text
 - **Token-based**: Splitting on the number of tokens - particularly useful for staying within LLM input token limits
- **Text Structure**: Text is naturally organized into hierarchical units such as paragraphs, sentences, and words. By leveraging this inherent structure, we can create chunks that preserve both natural language flow and semantic meaning across different levels of text granularity.
 - A `Recursive Character Text Splitter` exemplifies this approach by attempting to keep larger units like paragraphs intact but splitting when it exceeds the chunk size, moving to the next smallest unit like sentences and so forth.
- **Document Structure**: This approach leverages a document's natural organization, splitting content along its inherent structural boundaries - like sections, chapters, paragraphs, and tables - to create semantically meaningful chunks. By following the document's natural structure, this ensures smoother information flow and preserves topic boundaries - avoiding the mid-word splits and topic merging that can occur with simpler methods.
- **Semantic Meaning**:
Semantic-based splitting focuses on actual *meaning* rather than arbitrary fixed rules, analyzing how content changes to determine natural breakpoints for dynamic chunks of varying sizes. Using a sliding window approach to compare consecutive groups of sentences, it measures how closely their ideas relate to each other - what we call semantic similarity. This creates cohesive chunks that naturally group related content together. While this approach definitely creates more meaningful chunks - it's slower, requires more computational resources, and is much more complex to implement than other methods.

#### Using Unstructured

In [None]:
from unstructured.partition.pdf import partition_pdf
from langchain.docstore.document import Document

# Partition PDF
elements = partition_pdf("BCM_User_Manual.pdf", chunking_strategy="basic")

# Convert Unstructured elements to LangChain Documents
documents = []
for element in elements:
    # Get text content
    content = element.text if hasattr(element, 'text') else str(element)

    # Get page number safely
    page_number = None
    if hasattr(element, 'metadata'):
        if hasattr(element.metadata, 'page_number'):
            page_number = element.metadata.page_number

    # Create document
    doc = Document(
        page_content=content,
        metadata={
            'source': "BCM_User_Manual.pdf",
            'page': page_number
        }
    )
    documents.append(doc)

print(f"Loaded {len(documents)} document elements.")

## Assembling the Retrieval Pipeline
Now that we've preprocessed our data chunks and built our knowledge base, it's time to focus on the core retrieval components of our RAG architecture.

### Generating Embeddings

Let's implement our first NIM - the embedding model. We'll use this model to vectorize our data chunks and generate vector embeddings. Embeddings are numerical representations of text that capture semantic meaning in a way computers can understand. They convert words and phrases into lists of numbers (vectors) - similar to how we can describe any location on Earth using latitude and longitude coordinates. Just as nearby places have similar coordinates, texts with similar meanings will have similar numerical values. This is crucial for RAG systems because it lets us efficiently find related content by calculating how close these coordinates are to each other.

In our implementation, we'll leverage the `nv-embedqa-e5-v5` NIM from NGC, an optimized embedding model for text-based question-answering retrieval. This model, a fine-tuned variant of `E5-Large-Unsupervised`, has been trained on a public dataset to enhance retrieval accuracy and efficiency.

In [None]:
from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings

# Create embeddings
embedding_model = "nvidia/nv-embedqa-e5-v5"
embedder = NVIDIAEmbeddings(model=embedding_model, truncate="END")

### Storing Embeddings in a Vector Database
Now we'll store our newly generated embeddings in a vector database.

A vector database is designed to store and manage vector embeddings - the numerical representations of data like text or images we spoke about earlier.

Unlike traditional databases that store and search data based on exact matches, vector databases enable similarity searches by comparing how close data points are in a high-dimensional space. To use the previous example of coordinates on a map, when we query the vector database - we're simply finding the closest neighbors to our location and returning them.

Both LangChain and NVIDIA provide support for a wide selection of vector stores. For this workflow, we've selected [**Chroma**](https://www.trychroma.com/) as our vector database. Chroma is a powerful, open-source vector database designed specifically for AI applications. While there are many other excellent vector database solutions available, we chose Chroma for it's ease of use and portability.

Let’s set up Chroma and load our embeddings into the vector database.

In [None]:
from langchain_community.vectorstores import Chroma
import time

# Create and persist vectorstore
start_time = time.time()
vectorstore = Chroma.from_documents(
    documents=documents,
    embedding=embedder,
    collection_name="bcm_docs",
    persist_directory="./chroma_db"
)

if vectorstore:
    print(f"Vector database was successfully created! Total embeddings indexed: {len(documents)}")
else:
    print("Failed to create the vector database. Please check your input data.")

print(f"--- {time.time() - start_time} seconds ---")

### Adding a Reranker

Now that our embeddings are stored in Chroma, we need a way to refine our search results for better accuracy.

To achieve this, we'll implement a Reranker model. A Reranker reorders the retrieved results based on their relevance to the query. While our vector database efficiently finds similar embeddings, a reranker applies a more sophisticated scoring model—often transformer-based—to ensure that the most relevant matches appear at the top. Think of it as a second pass! This extra step improves the overall quality of responses, making our system more precise and contextually aware.

We'll use the `nv-rerankqa-mistral-4b-v3` NIM from NGC. Based on the Mistral 4B architecture, this model is fine-tuned to score and reorder retrieved documents, ensuring that the most relevant information is prioritized. The `top n` variable refers to the number of document chunks that will be selected and reranked by the reranking model. After the initial vector retrieval returns results, the reranker will process the top 10 chunks (in this case) and rearrange them according to their relevance to the query, before passing them to the LLM.

In [None]:
from langchain_nvidia_ai_endpoints import NVIDIARerank

NV_rerank = NVIDIARerank(model='nvidia/nv-rerankqa-mistral-4b-v3', top_n=10)

### Launching the LLM

To complete our pipeline, let's connect our retrieval tools to the LLM that will give our chatbot life. Let's deploy an LLM NIM from the NVIDIA AI Foundation Models.

To streamline this process, we can use **ChatNVIDIA** - a specialized module from LangChain that simplifies connecting to and interacting with NVIDIA's NIM APIs. **ChatNVIDIA** makes it easy to send queries, process responses, and manage inference settings without needing to handle raw API calls manually.

Using the `available_models` function shows you all the available models you can use. Notice that there are models available from NVIDIA, Google, Meta, Microsoft, Mistral, and others! These are constantly being updated.


In [None]:
from langchain_nvidia_ai_endpoints import ChatNVIDIA
from langchain_core.output_parsers import StrOutputParser

available_models = ChatNVIDIA.get_available_models()

# Extracting and printing just the model names
model_names = [model.id for model in available_models]
print("\n".join(model_names))

For our purposes today, we're going to use Meta's `Llama 3.1-8b-instruct` model. This is a great, general purpose LLM that's particularly popular for simple chatbots. It's a small LLM at only 8 billion parameters. But don't let that fool you - it's very powerful!

The model and its corresponding hyperparameters can be customized.

* **Temperature** controls the randomness of the model's responses. A lower temperature makes the output more focused and deterministic, while a higher temperature makes it more creative and diverse.
* **Top-p** (or nucleus sampling) determines the diversity of the generated text by considering only the smallest set of top predictions whose cumulative probability is at least p. It ensures that the model picks from a broader range of possible next words.
* **Max tokens** limits the length of the output by specifying the maximum number of tokens (words or word pieces) the model can generate in a single response. Since we only get free credits for 1,000 tokens - best we keep this really small!

In [None]:
foundation_model="meta/llama-3.1-8b-instruct"
llm = ChatNVIDIA(model=foundation_model, temperature=0.1, max_tokens=100, top_p=1.0)| StrOutputParser()

Now that we have our LLM up and running - let's ask it something very basic to see if it's working...

In [None]:
question = "What is the meaning of NVIDIA?"

answer = llm.invoke(question)
print(answer)

That was an easy one. Let's try asking it something from the knowledge base we created earlier with the NVIDIA Base Command Manager documentation.

In [None]:
question = "When using qstat to monitor the status of a job, what does a E R job state mean?"

answer = llm.invoke(question)
print(answer)

As expected, the foundational model won't be able to answer this question since it wasn't in it's original training dataset and it can't access external data. It's also worth noting that this was a purposefully tricky question which references data in both a table and formatted code listed below it.

<img src="https://i.ibb.co/qMpSrmms/BCM-qstat.png" width="100%">

To get all this to work, we need to connect our retrieval components — including the embedder, retriever, reranker, and LLM — into a structured workflow.

This is where a **chain** comes in. In **LangChain**, a chain is a sequence of modular components that pass data through a defined pipeline. By chaining our retrieval steps together, we can finish the backend of our application.

In [None]:
from langchain.prompts import ChatPromptTemplate
from langchain.schema import StrOutputParser
from langchain.schema.runnable import Runnable, RunnablePassthrough, RunnableConfig
from langchain_core.runnables import RunnableParallel

from langchain_nvidia_ai_endpoints import ChatNVIDIA

llm = ChatNVIDIA(model="meta/llama-3.1-8b-instruct")

retriever = vectorstore.as_retriever(search_kwargs={'k':100})

prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "Answer solely based on the following context:\n<Documents>\n{context}\n</Documents>",
        ),
        ("user", "{question}"),
    ]
)

reranker = lambda input: NV_rerank.compress_documents(query=input['question'], documents=input['context'])

chain = (
    RunnableParallel({"context": retriever, "question": RunnablePassthrough()})
    | {"context": reranker, "question": lambda input: input['question']}
    | prompt
    | llm
    | StrOutputParser()
)

In [None]:
chain.invoke(question)

And now with the retrieval pipeline in place, we have our correct answer! Finally let's move to the last section where we can create a nice UI that allows us to easily interact with the RAGbot.

## Setting up the Frontend

Let's build a professional GUI for our RAGBot using **Chainlit**, a open-source framework that enables us to create interactive visual interfaces through Python code.

### Theming
We'll start by setting up a configuration file to customize our application's appearance and branding. This involves creating a public directory for assets and generating a `config.toml` file with both dark and light theme settings, along with implementing our logo and favicon.

In [None]:
import os

# Create theming for UI

# Create the /public directory if it doesn't exist
os.makedirs("public", exist_ok=True)

# Create config.toml if it doesn't exist
if not os.path.exists("config.toml"):
    with open("config.toml", "w") as f:
      f.write("""\n
# ==============================================================================
######################### CHAINLIT CONFIGURATION FILE ##########################
# ==============================================================================

### GENERAL SETTINGS ===========================================================
### Alter general settings of the app ==========================================
### ============================================================================

[UI]
# App Name
name = "NVIDIA RAGBot"

# CSS File
# > Specify a CSS file that can be used to customize the user interface. The CSS
#   file can be served from the public directory or via an external link.
custom_css = "/public/stylesheet.css"

### DEFAULT THEME COLOURS ======================================================
### Change your app's default colour palette ===================================
# > Background colour: Change the colour of the app’s background.
# > Paper colour: Alters the colour of the ‘paper’ elements within
#   the app, such as the navbar, widgets, etc.
# > Primary colour: Encompasses three shades - main, dark, and light. These
#   colours are primarily used for interactive interface elements.
### ============================================================================

[UI.theme]
default = "dark"
layout = "wide"
font_family = "Inter, sans-serif"

# Modify Light Theme
[UI.theme.light]
background = "#FAFAFA"
paper = "#EBEEEF"

[UI.theme.light.primary]
main = "#76B900"
dark = "#76B900"
light = "#333333"

# Modify Dark Theme
[UI.theme.dark]
#background = "#FAFAFA"
#paper = "#FFFFFF"

[UI.theme.dark.primary]
main = "#76B900"
dark = "#1A1A1A"
light = "#333333"


""")
    print("config.toml created successfully!")

# Download the dark and light mode logos
!wget -O "public/logo_dark.png" -nc https://i.ibb.co/SPWqM8V/nvidia-logo-horizontal-white.png
!wget -O "public/logo_light.png" -nc https://i.ibb.co/3dzk4pH/nvidia-logo-horizontal-colour.png

# Download the favicon
!wget -O "public/favicon.ico" -nc https://nvidianews.nvidia.com/media/sites/219/images/favicon.ico

print("Logos and favicon downloaded successfully!")

### Chainlit
Now let's get the code to create a Chainlit-powered frontend connected to the backend that we've already put together. As much of this code is boilerplate and outside of the scope of this tutorial, we're going to download the full file to our system. But if you'd like to see the code itself, simply click on the `app.py` file in the directory after running this cell!

In [None]:
# Download frontend code from GitLab and save as app.py
!wget -nc -O app.py https://gitlab.com/adaveinthelife/external_blueprints/-/raw/main/app_files/Enterprise_RAG_BP_frontend_EXT.txt

### Tunneling and Deploying the Interface

Finally the last step where we bring the frontend and backend together to make a full Enterprise RAG application. To deploy our application, we'll run Chainlit but since we don't have it hosted on a webserver, we'll make it accessible from our notebook through a temporary public URL. After running this cell, you'll receive a unique URL that we can use to easily interact and demonstrate our application.

Remember that this application will stay active until the cell is stopped or the notebook is closed.

This final step combines our frontend and backend into a complete Enterprise RAG system. Since we're working in a notebook environment, we'll create a temporary public URL using an open-source tool named `Tunnelmole` to make our application accessible. After running the cell below, you'll receive a unique link where you can interact with and demonstrate your application.

**Remember** - the system will remain active until you stop the cell or close the notebook.

In [None]:
import subprocess
import time
import re

print("🟢 Initializing system...")

# Start Chainlit in the background
chainlit_process = subprocess.Popen(
    ["chainlit", "run", "app.py", "--headless", "--port", "8000"],
    stdout=subprocess.DEVNULL,
    stderr=subprocess.DEVNULL
)

# Wait for Chainlit to initialize
time.sleep(10)

def start_tunnelmole():
    """Starts Tunnelmole and extracts only the public HTTPS URL."""
    tunnel_process = subprocess.Popen(
        ["tunnelmole", "8000"],
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT,
        text=True
    )

    public_url = None
    for line in tunnel_process.stdout:
        match = re.search(r"(https://[\w-]+\.tunnelmole\.net)", line)
        if match:
            public_url = match.group(1)
            break  # Stop once URL is found

    return public_url

# Start Tunnelmole
public_url = start_tunnelmole()

if public_url:
    print(f"\n🚀 Access your application here: {public_url}\n")
    print("\n🔹 This session will remain active until you manually stop it.")
    print("🔹 To stop the session, either:")
    print("   - Press the 'Stop' button in the top left corner of the cell.")
    print("   - Close the notebook.")
else:
    print("\n❌ Failed to start Tunnelmole.\n")


# Keep the session alive
while True:
    time.sleep(60)

## What's Next? ,🚀
Thank you for taking the time to complete this course and workbook! We hope it has been informative and given you practical experience in building advanced AI solutions, especially in applying Retrieval-Augmented Generation (RAG) techniques to enhance chatbot capabilities. If you're interested in furthering your knowledge of NVIDIA AI technologies, we encourage you to explore additional resources:

*   [Explore NVIDIA's Foundational Models page](https://build.nvidia.com/explore/discover) to explore a growing library of cutting-edge AI models running in NVIDIA NIMs. You can demo each model from your browser allowing you to experiment with their capabilities and easily integrate them into your projects.
*   [Visit the NVIDIA Developer Blog](https://developer.nvidia.com/blog) for the latest tutorials and articles on AI, data science, and GPU-accelerated computing.
*   [Check out the NVIDIA AI Deep Learning Institute](https://www.nvidia.com/en-us/training/) to dive deeper into topics like deep learning, model optimization, and more.
*   [Join the NVIDIA Developer Program](https://developer.nvidia.com/) to access exclusive tools, software development kits (SDKs), and forums where you can connect with experts in the AI field.

By staying connected with the NVIDIA ecosystem, you'll be able to keep up with cutting-edge advancements and continue honing your skills in AI.

Once again, congratulations on your achievement, and thank you for being part of this exciting journey into the world of Generative AI!

In [None]:
'''                   ======================================
                      ======================================
                      ======================================
               ========        =============================
           ==================      =========================
        =======       ===========      =====================
     =======      +====   ==========     ===================
  =======+     ========        =======     =================
========    ======+   ====       =======     ===============
=======   ======      ======    =======     ================
 =======   =====+     ======= ========    ==================
  =======   ======    ==============    ====================
   +======   +=====   ===========+    =========     ========
     ======    +==============     =========           =====
      =======     =====        ==========           ========
        +=======      ===============+          +===========
           =========  ===========          =================
              =========              =======================
                   +===     ================================
                      ======================================
                      ======================================
'''
print("Until next time!")