# **NVIDIA Riva RAG Lab: End-to-End Voice Chatbot**

## **1. Introduction**
Welcome to the **Riva RAG Ingestion Lab**. This notebook demonstrates how to build a complete **Voice-Enabled Retrieval Augmented Generation (RAG)** system from scratch. Unlike previous versions, this lab allows you to **ingest your own PDF documents**, create a vector index, and then query it using your voice.

## **2. Objective**
By the end of this lab, you will have built:
1.  A **Knowledge Base** created from raw PDF files.
2.  A **RAG Pipeline** that retrieves relevant context to answer questions.
3.  A **Voice Interface** that listens to your speech and responds with synthesized audio.

## **3. Technologies Used**
- **NVIDIA Riva**: For high-performance Automatic Speech Recognition (ASR) and Text-to-Speech (TTS).
- **LangChain**: For orchestrating the RAG pipeline.
- **FAISS**: For efficient vector storage and similarity search.
- **NVIDIA Embeddings (NV-EmbedQA-E5)**: For generating semantic vectors.
- **Meta Llama 3 (8B)**: For generating intelligent responses.


## **4. Architecture Overview**

![Riva RAG Architecture](docs/Riva_RAG_Architecture.png)

### **System Flow Explanation**
The diagram above illustrates the complete **Voice-Enabled RAG Pipeline**:

1.  **Ingestion (Left Side)**:
    - **PDF Documents**: Raw files are loaded from the `./pdf` directory.
    - **Preprocessing**: Text is extracted and split into chunks.
    - **Embedding**: The NVIDIA Embedding model converts text into vectors (using `input_type="passage"`).
    - **Vector Store**: A FAISS index stores these vectors locally.

2.  **Voice Interaction (Right Side)**:
    - **User Voice**: Your spoken question is captured.
    - **ASR (Automatic Speech Recognition)**: NVIDIA Riva translates audio to text.
    - **RAG Retrieval**: The system queries the FAISS index (using `input_type="query"`) to find relevant context.
    - **LLM Generation**: The Meta Llama 3 model generates a concise answer based on the retrieved context.
    - **TTS (Text-to-Speech)**: NVIDIA Riva converts the text answer back into natural-sounding speech.


### Scenario: InnovateSphere Corporation

We'll work with documents from InnovateSphere, a fictional company with three departments:
- **HR**: Employee policies, leave entitlements, code of conduct
- **Marketing**: Brand guidelines, campaigns, social media policies  
- **Sales**: Product catalogs, sales reports, team directories


## **Step 1: Environment Setup**

### **Dependency Resolution**
The cell below performs a "clean slate" installation:
- **Uninstalls** conflicting versions of LangChain and Pydantic.
- **Installs** a pinned, compatible set of libraries (LangChain 0.2.x, Pydantic v1) to ensure stability.

> **Note:** You may need to restart the kernel after running this cell.

In [None]:
# Dependency Resolution (Clean Slate)
# 1. Uninstall specific conflicting packages first
%pip uninstall -y langchain langchain-core langchain-community langchain-text-splitters langchain-openai pydantic -q

# 2. Install pinned compatible versions (LangChain 0.2.x, Pydantic v1)
%pip install --upgrade "pydantic<2.0.0" "langchain==0.2.14" "langchain-community==0.2.12" "langchain-core==0.2.33" "langchain-text-splitters==0.2.2" "langchain-openai==0.1.22" "nvidia-riva-client" "requests" "faiss-cpu" "httpx" "pypdf" -q

# 3. Verify versions
%pip list | grep -E "langchain|pydantic"

### **Import Libraries**
Import necessary Python modules for File I/O, Networking, Riva Client, and LangChain components.

In [None]:
import os
import wave
import json
import requests
import riva.client
import IPython.display as ipd
from IPython.display import display
import httpx

# LangChain Imports
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

### **Log Suppression**
To keep the notebook output clean, we suppress verbose logging from the HTTP client and FAISS.

In [None]:
# Suppress Logging
import logging
import os

# Suppress HTTPX logs (POST requests)
logging.getLogger("httpx").setLevel(logging.WARNING)
logging.getLogger("httpcore").setLevel(logging.WARNING)

# Suppress FAISS logs (via environment variable - requires restart if already loaded)

print("Verbose logging suppressed.")

## **Step 2: Configuration**

Here we define the API endpoints and Authentication Tokens.

> **IMPORTANT:** You must replace `PLACEHOLDER_EMBEDDING_ENDPOINT`, `PLACEHOLDER_LLM_ENDPOINT`with your actual endpoints and `PLACEHOLDER_EMBEDDING_TOKEN`, `PLACEHOLDER_LLM_TOKEN` with your actual keys.

### **How to Generate API Tokens**

To run this lab, you need endpoints and tokens for the **NVIDIA Embedding** and **Llama 3** models. Follow these steps:

**1. Access Gen AI Studio**
Open the Generative AI Studio from your dashboard.
![Access Gen AI](docs/accessing_Gen%20AI.png)

**2. Navigate to Model Endpoints**
Click on the "Model Endpoints" option in the side menu.
![Model Endpoints](docs/Model_endpoints_option.png)

**3. Select a Model**
Locate and select the model you need (e.g., `meta/llama3-8b-instruct` or `nvidia/nv-embedqa-e5-v5`).
![Select Model](docs/accessing%20the%20model%20endpoint.png)

**4. Copy the Endpoint URL**
Copy the "Endpoint URL" shown on the screen. Paste this into the `EMBED_ENDPOINT` or `LLM_ENDPOINT` variable below.
![Copy Endpoint](docs/copy_endpoint.png)

**5. Generate Token**
Click the **"Generate Token"** button.
![Generate Button](docs/Generate_token_button.png)

**6. Copy the API Key**
A token will be generated. Copy this key immediately.
![Copy Key](docs/copy_key.png)

> **Repeat** this process for both the **Embedding Model** and the **LLM**.


In [None]:
# Configuration

# API Endpoints
EMBED_ENDPOINT = "PLACEHOLDER_EMBEDDING_ENDPOINT"
LLM_ENDPOINT = "PLACEHOLDER_LLM_ENDPOINT"

# Authentication Tokens (Update these!)
EMBEDDING_AUTH_TOKEN = "PLACEHOLDER_EMBEDDING_TOKEN"
LLM_AUTH_TOKEN = "PLACEHOLDER_LLM_TOKEN"

# Riva URI
RIVA_URI = "10.179.253.43:32222"

# Directories
OUTPUT_DIR = "./outputs"
PDF_DIR = "./pdf"
EMBEDDING_DIR = "./embeddings"

os.makedirs(OUTPUT_DIR, exist_ok=True)
os.makedirs(PDF_DIR, exist_ok=True)
os.makedirs(EMBEDDING_DIR, exist_ok=True)

In [None]:
# HTTP Client
http_client = httpx.Client(verify=False)

# 1. Ingestion Embeddings (Passage)
ingestion_embeddings = OpenAIEmbeddings(
    base_url=f"{EMBED_ENDPOINT}/v1",
    api_key=EMBEDDING_AUTH_TOKEN,
    model="nvidia/nv-embedqa-e5-v5",
    check_embedding_ctx_length=False,
    http_client=http_client,
    model_kwargs={"extra_body": {"input_type": "passage"}}
)

# 2. Retrieval Embeddings (Query)
retrieval_embeddings = OpenAIEmbeddings(
    base_url=f"{EMBED_ENDPOINT}/v1",
    api_key=EMBEDDING_AUTH_TOKEN,
    model="nvidia/nv-embedqa-e5-v5",
    check_embedding_ctx_length=False,
    http_client=http_client,
    model_kwargs={"extra_body": {"input_type": "query"}}
)

# 3. LLM
llm = ChatOpenAI(
    base_url=f"{LLM_ENDPOINT}/v1",
    api_key=LLM_AUTH_TOKEN,
    model="meta/llama3-8b-instruct",
    temperature=0.1,
    max_tokens=1024,
    http_client=http_client
)

print("Components initialized successfully.")

## **Step 4: Data Ingestion (Knowledge Base Creation)**

In this step, we build the "Brain" of our chatbot:
1.  **Load**: Read all PDF files from the `./pdf` directory.
2.  **Split**: Break the text into smaller chunks (800 characters) to fit into the models' context window.
3.  **Embed & Index**: Convert text chunks into vectors and save them locally using FAISS.

> **Action**: Ensure you have uploaded at least one PDF to the `pdf` folder!

In [None]:
print(f"Loading PDFs from {PDF_DIR}...")
loader = PyPDFDirectoryLoader(PDF_DIR)
data = loader.load()

if not data:
    print("WARNING: No PDF documents found! Please upload files to the ./pdf folder.")
else:
    print(f"Loaded {len(data)} pages.")
    
    # Split Text
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=800,
        chunk_overlap=60,
    )
    documents = text_splitter.split_documents(data)
    print(f"Created {len(documents)} text chunks.")
    
    # Create Index (using ingestion_embeddings)
    print("Creating FAISS index (this may take a moment)...")
    docsearch = FAISS.from_documents(documents, embedding=ingestion_embeddings)
    
    # Save Index
    docsearch.save_local(folder_path=EMBEDDING_DIR)
    print(f"Index saved to {EMBEDDING_DIR}")

## **Step 5: RAG Pipeline Setup**

Now that we have our index, we set up the retrieval chain:
1.  **Load Index**: Read the FAISS index from disk using the `retrieval_embeddings`.
2.  **Prompt**: Define how the LLM should behave. We enforce concise answers for better Voice interaction.
3.  **Chain**: Combine the Retriever + Prompt + LLM into a single callable object.

In [None]:
# Load Index (using retrieval_embeddings)
try:
    vector_store = FAISS.load_local(
        EMBEDDING_DIR, 
        retrieval_embeddings, 
        allow_dangerous_deserialization=True
    )
    retriever = vector_store.as_retriever()
    print("FAISS index loaded for retrieval.")

    # Prompt Template
    prompt = ChatPromptTemplate.from_template("""
    You are a voice assistant. Answer strictly based on the provided context.
    
    IMPORTANT GUIDELINES:
    1. Be extremely concise. Keep your answer under 2 sentences.
    2. Do not use bullet points or special formatting.
    3. If the answer is not in the context, say "I don't know."
    
    <context>
    {context}
    </context>
    
    Query: {input}
    
    Answer in {language}:
    """)

    # Create Chain
    document_chain = create_stuff_documents_chain(llm, prompt)
    retrieval_chain = create_retrieval_chain(retriever, document_chain)
    print("RAG Chain ready.")
    
    
except Exception as e:
    print(f"Error loading index: {e}")
    print("Did you run the ingestion step above?")

## **Step 6: Initialize NVIDIA Riva**

We set up the **Riva Client** to access the high-performance Speech AI services running on the cluster.
- **ASRService**: For transcribing your microphone input.
- **SpeechSynthesisService**: For speaking the AI's response.

In [None]:
try:
    auth = riva.client.Auth(uri=RIVA_URI)
    asr_service = riva.client.ASRService(auth)
    tts_service = riva.client.SpeechSynthesisService(auth)
    print("✓ Riva services initialized.")
except Exception as e:
    print(f"✗ Error initializing Riva: {e}")

## **Step 7: Define ASR Function**

This function transcribes audio files using the **Parakeet multilingual ASR model** with automatic language detection.

**Key Features:**
- Supports multiple languages (English, Spanish, French, German, Chinese, and more)
- Automatically detects the spoken language
- Enables automatic punctuation for better readability

In [None]:
def transcribe_audio(audio_file):
    try:
        with open(audio_file, 'rb') as fh:
            data = fh.read()
        config = riva.client.RecognitionConfig(
            encoding=riva.client.AudioEncoding.LINEAR_PCM,
            max_alternatives=1,
            enable_automatic_punctuation=True,
            verbatim_transcripts=False,
            language_code="multi"
        )
        riva.client.add_audio_file_specs_to_config(config, audio_file)
        response = asr_service.offline_recognize(data, config)
        if len(response.results) > 0:
            return response.results[0].alternatives[0].transcript
        return ""
    except Exception as e:
        print(f"ASR Error: {e}")
        return ""

## **Step 7: Define RAG Query Function**

This function sends the transcribed text to the **retrieval_chain**.

**How it works:**
1. Receives the user's transcribed query
2. Adds language instruction to the system context
3. Invoke the retrieval_chain.
4. Retrieves context from the vector database
5. Generates a response using the LLM

In [None]:
def query_rag(text, output_language="English"):
    try:
        response = retrieval_chain.invoke({"input": text, "language": output_language})
        return response["answer"]
    except Exception as e:
        print(f"RAG Error: {e}")
        return "Error retrieving answer."


## Step 8. Define TTS Function

This function converts text responses to speech using the **Magpie multilingual TTS model**.

**Key Features:**
- Supports multiple languages with natural-sounding voices
- Saves audio files to `./outputs/` directory
- Auto-plays the generated audio in the notebook

In [None]:
tts_counter = 0
def speak_text(text, language_code="en-US"):
    global tts_counter
    try:
        print(f"Synthesizing: {text[:50]}...")
        response = tts_service.synthesize(
            text, language_code=language_code, sample_rate_hz=44100
        )
        tts_counter += 1
        output_file = os.path.join(OUTPUT_DIR, f"response_{tts_counter}.wav")
        with wave.open(output_file, 'wb') as out_f:
             out_f.setnchannels(1)
             out_f.setsampwidth(2)
             out_f.setframerate(44100)
             out_f.writeframesraw(response.audio)
        display(ipd.Audio(output_file, autoplay=True))
    except Exception as e:
        print(f"TTS Error: {e}")

## **Step 9: Launch Chatbot**

Finally, we launch the interactive UI.
- **Record**: Click to speak your question.
- **Process**: The system transcribes audio -> Queries RAG -> Generates Answer -> Synthesizes Speech.
- **Listen**: The answer will be played back automatically.

In [None]:
from chatbot_ui import create_chatbot_ui
create_chatbot_ui(transcribe_audio, query_rag, speak_text)