## Install Dependencies
We will be mainly interacting with HuggingFace, NVIDIA, and LlamaIndex. \ 
Lets get started by installing the python dependencies needed for this demo. \


In [None]:
%pip install gradio
%pip install llama-index-llms-huggingface
%pip install llama-index-llms-huggingface-api
%pip install llama-index-embeddings-huggingface
%pip install llama-index-embeddings-huggingface-api
%pip install --upgrade --quiet llama-index-llms-nvidia llama-index-embeddings-nvidia llama-index-readers-file llama-index llama-index-readers-web

In [None]:
!pip install "transformers[torch]" "huggingface_hub[inference]"

In [None]:
!pip install llama-index

In [None]:
import getpass
import os
os.environ["HF_TOKEN"] = ""
os.environ["NVIDIA_API_KEY"] = ""

## Prepare Environment
Lets prepare and set up the environment we need for interacting with various APIs. \
We will need the following API Keys: \
**HuggingFace Access Token** - to retrieve and download the gated Llama-3.1-8B-Instruct model (appy for access [here](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct))\
**NGC Key** - to download NIMs containers to run On-Prem

In [1]:
hf_token = ""

import getpass
import os

# del os.environ['HF_TOKEN']  ## delete key and reset
if os.environ.get("HF_TOKEN", "").startswith("hf_"):
    print("Valid HF_TOKEN already in environment. Delete to reset")
else:
    hf_token = getpass.getpass("HF TOKEN (starts with hf_): ")
    assert hf_token.startswith(
        "hf_"
    ), f"{hf_token[:5]}... is not a valid key"
    os.environ["HF_TOKEN"] = hf_token

HF TOKEN (starts with hf_):  ········


In [2]:
import getpass
import os

# del os.environ['NVIDIA_API_KEY']  ## delete key and reset
if os.environ.get("NVIDIA_API_KEY", "").startswith("nvapi-"):
    print("Valid NVIDIA_API_KEY already in environment. Delete to reset")
else:
    nvapi_key = getpass.getpass("NVAPI Key (starts with nvapi-): ")
    assert nvapi_key.startswith(
        "nvapi-"
    ), f"{nvapi_key[:5]}... is not a valid key"
    os.environ["NVIDIA_API_KEY"] = nvapi_key

NVAPI Key (starts with nvapi-):  ········


## Inferencing with On-Prem NVIDIA Inference Microservices (NIM)
In this section, running the following code cells will consume a locally-deployed Llama-3.1-8B-Instruct NIM container and serve them for inferncing.

In [3]:
# llama-parse is async-first, running the async code in a notebook requires the use of nest_asyncio
import nest_asyncio

nest_asyncio.apply()

### Sample inferencing with On-Prem NIM model
In the following code cell, we will first specify the NIM endpoint serving Llama-3.1-8B-Instruct model, and execute a sample inference afterwards

In [4]:
from llama_index.llms.nvidia import NVIDIA
from llama_index.core.llms import ChatMessage

# connect to an chat NIM running at localhost:8000, spcecifying a specific model
local_nim_llm = NVIDIA(
    base_url="http://localhost:8000/v1", model="meta/llama-3.1-8b-instruct"
)

# content = ""
# for completion in local_nim_llm.stream_complete("What is Dell Technologies?"):
#     content += completion.delta
#     print(completion.delta, end="")

messages = [
    ChatMessage(role="system", content="You are a helpful assistant."),
    ChatMessage(role="user", content="What is Dell Technologies?"),
]

content = ""
for completion in local_nim_llm.stream_chat(messages):
    content += completion.delta
    print(completion.delta, end="")

Dell Technologies is a multinational computer technology company that designs, develops, manufactures, markets, and supports computers and other electronic devices. The company was founded in 1984 by Michael Dell and is headquartered in Round Rock, Texas, USA.

Dell Technologies is a leading provider of a wide range of products and services, including:

1. **Personal computers**: Desktops, laptops, tablets, and mobile devices.
2. **Servers**: Data center servers, storage systems, and networking equipment.
3. **Storage**: Storage solutions, including hard disk drives, solid-state drives, and storage arrays.
4. **Networking**: Networking equipment, including switches, routers, and wireless access points.
5. **Virtualization**: Virtualization software and services, including VMware.
6. **Cloud computing**: Cloud infrastructure, platform, and software as a service (IaaS, PaaS, SaaS).
7. **Cybersecurity**: Cybersecurity solutions, including threat detection, incident response, and security 

## Retrieval Augmented Generation (RAG) with On-Prem NIM
In this section, we will code up RAG utilizing on-prem NIM, and eventually present those in a interactive chatbot where users can:\
1. scrape websites for additional content
2. upload files for additional content
3. Choose between utilizing knowledge base or not

### Writing the logic for RAG
In the following code dell, we will develop a custom LlamaIndex Query Engine Class which does the following:
- Sets the deployed NIM endpoint for LLM inferencing
- Loads and sets a local embedding model from HuggingFace 
- Instantiates a in-memory Vector Database
- Instantiates and connects the LLM, Embedding Model, and Vector Databases together into a Query Engine

In [5]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.readers.web import SimpleWebPageReader
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.nvidia import NVIDIA
from pathlib import Path
import os

os.environ['GRADIO_TEMP_DIR'] = "/home/jovyan/work/gradio"

RAG_UPLOAD_FOLDER = "/home/jovyan/work/rag-documents/"


class Custom_Query_Engine():
    def __init__(self):
        self.SYSTEM_PROMPT = """You are an AI assistant that answers questions in a friendly manner. Here are some rules you always follow:
        - Generate human readable output, avoid creating output with gibberish text.
        - Make use of the additional context given to provide better answers.
        - Elaborate on your responses based on the context given.
        - Give as much detail as you can to help the user with the query.
        """
        
        self.llm = NVIDIA(
            base_url="http://localhost:8000/v1", model="meta/llama-3.1-8b-instruct"
        )

        # self.embed_model = IpexLLMEmbedding(model_name="/llm-models/hf-models/bge-small-en-v1.5", trust_remote_code=True)
        self.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
            
        Settings.llm = self.llm
        Settings.embed_model = self.embed_model
        
        Path(RAG_UPLOAD_FOLDER).mkdir(parents=True, exist_ok=True)
        self.use_rag = False

    def toggle_rag(self, toggle):
        self.use_rag = toggle
        return self.use_rag

    def get_rag_toggle(self):
        return self.use_rag

    def query(self, message):
        return self.query_engine.query(message)

    def query_without_rag(self, message):
        messages = [
            ChatMessage(role="system", content="You are a helpful assistant."),
            ChatMessage(role="user", content=message),
        ]
        return self.llm.stream_chat(messages)

    def reload_scraped(self, documents):

        try:
            del self.query_engine
        except:
            print("instantiating new query engine")
        else:
            print("re-creating query engine")

        try:
            del self.index
        except:
            print("instantiating new index")
        else:
            print("re-creating index")
            
        self.index = VectorStoreIndex.from_documents(documents, show_progress=True)
        self.query_engine = self.index.as_query_engine(streaming=True, similarity_top_k=6)

    def reload_uploaded(self, path):

        try:
            del self.query_engine
        except:
            print("instantiating new query engine")
        else:
            print("re-creating query engine")

        try:
            del self.index
        except:
            print("instantiating new index")
        else:
            print("re-creating index")
            
        self.documents = SimpleDirectoryReader(RAG_UPLOAD_FOLDER).load_data()
        self.index = VectorStoreIndex.from_documents(self.documents, show_progress=True)
        self.query_engine = self.index.as_query_engine(streaming=True, similarity_top_k=6)

### Writing additional helper functions for Gradio RAG Chatbot
Next, we will write some additional helper functions to take in information passed via our Gradio Frontend, process these information and feed it into our query engine. These functions include:
- Toggling of knowledge base
- Scraping and vectorizing files grabbed from custom URLs
- Vectorizig files uploaded by the user

In [6]:
import gradio as gr
import shutil
import glob
import os

def stream_response(message, history):
    print(f"current RAG toggle is {query_engine.get_rag_toggle()}")
    if query_engine.get_rag_toggle():
        print('using RAG')
        response = query_engine.query(message)
        print(response.source_nodes[0].get_content())
        res = ""
        for token in response.response_gen:
            # print(token, end="")
            res = str(res) + str(token)
            yield res
    else:
        print('not using RAG')
        response = query_engine.query_without_rag(message)
        res = ""
        for token in response:
            # print(token, end="")
            res = str(res) + str(token.delta)
            yield res

def vectorize_scrape(url, progress=gr.Progress()):
    Path(RAG_UPLOAD_FOLDER).mkdir(parents=True, exist_ok=True)
    UPLOAD_FOLDER = RAG_UPLOAD_FOLDER

    prev_files = glob.glob(f"{UPLOAD_FOLDER}*")
    for f in prev_files:
        os.remove(f)

    if not url:
        return []
    
    documents = SimpleWebPageReader(html_to_text=True).load_data([url])


    query_engine.reload_scraped(documents)
    
    return url

def vectorize_uploads(files, progress=gr.Progress()):
    Path(RAG_UPLOAD_FOLDER).mkdir(parents=True, exist_ok=True)
    UPLOAD_FOLDER = RAG_UPLOAD_FOLDER

    prev_files = glob.glob(f"{UPLOAD_FOLDER}*")
    for f in prev_files:
        os.remove(f)

    if not files:
        return []
    
    file_paths = [file.name for file in files]

    for file in files:
        shutil.copy(file.name, UPLOAD_FOLDER)

    query_engine.reload_uploaded(UPLOAD_FOLDER)
    
    return file_paths

def toggle_knowledge_base(use_rag):
    print(f"toggling use knowledge base to {use_rag}")
    query_engine.toggle_rag(use_rag)
    return

### Creating a Gradio UI for interacting with On-Prem NIM
We will now create a Gradio frontend to consume the deployed On-Prem NIM model. \
Launching Gradio clients will consume ports incrementally starting from port `7860` \
As this is the second demo we are launching concurrently, it will be served on port `7860`

In [7]:
import gradio as gr

query_engine = Custom_Query_Engine()

css = """
.app-interface {
    height:80vh;
}
.chat-interface {
    height: 75vh;
}
.file-interface {
    height: 40vh;
}
"""
with gr.Blocks(css=css) as demo:
    gr.Markdown(
    """
    <h1 style="text-align: center;">NIMs Document Chatbot 💻📑✨</h3>
    """)
    with gr.Row(equal_height=False, elem_classes=["app-interface"]):
        with gr.Column(scale=4, elem_classes=["chat-interface"]):
            test = gr.ChatInterface(fn=stream_response)
        with gr.Column(scale=1):
            url_input = gr.Textbox(label="Reference File URL", lines=1)
            scrape_button = gr.Button("Scrape Site")
            scrape_button.click(fn=vectorize_scrape, inputs=url_input, outputs=url_input)
            # file_input = gr.File(elem_classes=["file-interface"], file_types=["pdf", "csv", "text", "html"], file_count="multiple")
            file_input = gr.File(elem_classes=["file-interface"], file_types=["file"], file_count="multiple")
            vectorize_button = gr.Button("Vectorize Files")
            vectorize_button.click(fn=vectorize_uploads, inputs=file_input, outputs=file_input)
            use_rag = gr.Checkbox(label="Use Knowledge Base")
            use_rag.select(fn=toggle_knowledge_base, inputs=use_rag)
            

demo.launch(server_name="0.0.0.0", ssl_verify=False, inline=False)



* Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.




current RAG toggle is False
not using RAG
instantiating new query engine
instantiating new index


Parsing nodes:   0%|          | 0/22 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/22 [00:00<?, ?it/s]

toggling use knowledge base to True
current RAG toggle is True
using RAG
Validation findings  
 
19 Generative AI in the Enterprise – Model Training  
A Scalable and Modular Production Infrastructure with NVIDIA for AI Large Language Model Training  
Technical White Paper  
   
Model training or pretraining yields a  foundational LLM by training it on a large corpus of 
data. We vali dated our design to ensure the functionality of model training techniqu e 
available in the NeMo framework. Our goal in this validation was not to train a model to 
convergence  and generate a complete foundational model , but rather to train  for a defined 
numb er of steps in order to achieve the goals described here . 
The following list provides the details of our validation setup:  
• Model architectures
 We trained primarily with 7B and 70B Llama 2 model  
architectures . We also trained with  GPT model  architectures . 
• Foundation model pre-training using NeMo Framework
 See the NeMo 
documentatio

**NOTE: To access the On-Prem NIM RAG Demo, please dont use the above URL, use this instead: `http://<YOUR-VM-IP-ADDRESS>:7860`**\
The gradio frontend for this specific demo is served on port `7860` \
For example, if my VM IP address is 172.27.193.230, go to  `http://172.27.193.230:7860`