# NVIDIA Inference Microservices (NIMs) Comparison
In this demo, utilizing the following code cells in this notebook, we will be interacting and comparing the various types of NIMs against traditional inferencing scenarios. \
**NOTE**: The cells in this workbook are mostly dependent on one another, please execute them serially in order.

## Install Dependencies
We will be mainly interacting with HuggingFace, NVIDIA, and LlamaIndex. \ 
Lets get started by installing the python dependencies needed for this demo. \


In [None]:
%pip install gradio
%pip install llama-index-llms-huggingface
%pip install llama-index-llms-huggingface-api
%pip install --upgrade --quiet llama-index-llms-nvidia llama-index-embeddings-nvidia llama-index-readers-file

In [None]:
!pip install "transformers[torch]" "huggingface_hub[inference]"

In [None]:
!pip install llama-index

In [None]:
import getpass
import os
os.environ["HF_TOKEN"] = ""
os.environ["NVIDIA_API_KEY"] = ""

## Prepare Environment
Lets prepare and set up the environment we need for interacting with various APIs. \
We will need the following API Keys: \
**HuggingFace Access Token** - to retrieve and download the gated Llama-3.1-8B-Instruct model (appy for access [here](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct))\
**NGC Key** - to download NIMs containers to run On-Prem

In [1]:
hf_token = ""

import getpass
import os

# del os.environ['HF_TOKEN']  ## delete key and reset
if os.environ.get("HF_TOKEN", "").startswith("hf_"):
    print("Valid HF_TOKEN already in environment. Delete to reset")
else:
    hf_token = getpass.getpass("HF TOKEN (starts with hf_): ")
    assert hf_token.startswith(
        "hf_"
    ), f"{hf_token[:5]}... is not a valid key"
    os.environ["HF_TOKEN"] = hf_token

HF TOKEN (starts with hf_):  ········


In [2]:
import getpass
import os

# del os.environ['NVIDIA_API_KEY']  ## delete key and reset
if os.environ.get("NVIDIA_API_KEY", "").startswith("nvapi-"):
    print("Valid NVIDIA_API_KEY already in environment. Delete to reset")
else:
    nvapi_key = getpass.getpass("NVAPI Key (starts with nvapi-): ")
    assert nvapi_key.startswith(
        "nvapi-"
    ), f"{nvapi_key[:5]}... is not a valid key"
    os.environ["NVIDIA_API_KEY"] = nvapi_key

NVAPI Key (starts with nvapi-):  ········


## Inferencing without NIMs
In this section, running the following code cells will download models and serve them for inferncing without NIMs, directly using HuggingFace

### Download Llama-3.1-8B-Instruct tokenizer & model from HuggingFace

#### Loading the Llama-3.1-8B-Instruct Tokenizer

In [3]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    token=hf_token,
)

stopping_ids = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>"),
]

#### Loading the Llama-3.1-8B-Instruct Model

In [4]:
# generate_kwargs parameters are taken from https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct

import torch
from llama_index.llms.huggingface import HuggingFaceLLM

# Optional quantization to 4bit
# import torch
# from transformers import BitsAndBytesConfig

# quantization_config = BitsAndBytesConfig(
#     load_in_4bit=True,
#     bnb_4bit_compute_dtype=torch.float16,
#     bnb_4bit_quant_type="nf4",
#     bnb_4bit_use_double_quant=True,
# )

local_llm = HuggingFaceLLM(
    model_name="meta-llama/Llama-3.1-8B-Instruct",
    max_new_tokens=512,
    model_kwargs={
        "token": hf_token,
        "torch_dtype": torch.bfloat16,  # comment this line and uncomment below to use 4bit
        # "quantization_config": quantization_config
    },
    generate_kwargs={
        "do_sample": True,
        "temperature": 0.6,
        "top_p": 0.9,
    },
    tokenizer_name="meta-llama/Llama-3.1-8B-Instruct",
    tokenizer_kwargs={"token": hf_token},
    stopping_ids=stopping_ids,
)








Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

### Sample inferencing with locally-loaded HF model

#### Non-streaming Response

In [5]:
from llama_index.core.llms import ChatMessage

# response = local_llm.complete("What is Dell Technologies?")
# print(response)

messages = [
    ChatMessage(role="system", content="You are a helpful assistant."),
    ChatMessage(role="user", content="What is Dell Technologies?"),
]
response = local_llm.chat(messages)
print(response)

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


assistant: Dell Technologies is an American multinational computer technology company that designs, develops, manufactures, markets, and services information technology (IT) products and services. The company was founded in 1984 by Michael Dell in his dorm room at the University of Texas at Austin.

Dell is one of the largest technology companies in the world and is known for its wide range of products and services, including:

1. **Personal Computers**: Dell offers a variety of desktops, laptops, and tablets for both personal and business use.
2. **Servers**: Dell provides a range of servers for data centers, including rack servers, blade servers, and storage systems.
3. **Storage**: Dell offers a variety of storage solutions, including hard disk drives, solid-state drives, and storage arrays.
4. **Networking**: Dell provides a range of networking products, including switches, routers, and network security solutions.
5. **Software**: Dell offers a range of software solutions, includin

#### Streaming Response

In [6]:
from llama_index.core.llms import ChatMessage

# content = ""
# for completion in local_llm.stream_complete("What is Dell Technologies?"):
#     content += completion.delta
#     print(completion.delta, end="")

messages = [
    ChatMessage(role="system", content="You are a helpful assistant."),
    ChatMessage(role="user", content="What is Dell Technologies?"),
]

content = ""
for completion in local_llm.stream_chat(messages):
    content += completion.delta
    print(completion.delta, end="")

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Dell Technologies is a multinational computer technology company that designs, develops, manufactures, markets, and services computers and other electronic devices. The company was founded in 1984 by Michael Dell in his dorm room at the University of Texas at Austin.

Dell is one of the largest technology companies in the world, and it has a diverse portfolio of products and services that cater to various industries, including:

1. **Personal Computers**: Dell offers a range of consumer and business PCs, including desktops, laptops, and tablets.
2. **Servers and Storage**: Dell provides a wide range of servers, storage systems, and networking solutions for businesses and data centers.
3. **Networking and Security**: Dell offers a variety of networking and security solutions, including switches, routers, and firewalls.
4. **Data Analytics and AI**: Dell provides data analytics and artificial intelligence solutions, including data storage, processing, and visualization tools.
5. **Cloud 

### Creating a Gradio UI for interacting with Local HuggingFace LLM
We will now create a Gradio frontend to consume the locally-loaded HF model. \
Launching Gradio clients will consume ports incrementally starting from port `7860` \
As this is the first demo we are launching concurrently, it will be served on port `7860`

In [7]:
import gradio as gr
from IPython.display import Markdown
from llama_index.core.llms import ChatMessage

css = """
.app-interface {
    height:80vh;
}
.chat-interface {
    height: 75vh;
}
.file-interface {
    height: 40vh;
}
"""

def stream_response_local(message, history):
    messages = [
        ChatMessage(role="system", content="You are a helpful assistant."),
        ChatMessage(role="user", content=message),
    ]
    response = local_llm.stream_chat(messages)
    res = ""
    for token in response:
        # print(token, end="")
        res = str(res) + str(token.delta)
        yield res

with gr.Blocks(css=css) as demo1:
    gr.Markdown(
    """
    <h1 style="text-align: center;">Local HuggingFaceLLM Chatbot 💻📑✨</h3>
    """)
    with gr.Row(equal_height=False, elem_classes=["app-interface"]):
        with gr.Column(scale=4, elem_classes=["chat-interface"]):
            test = gr.ChatInterface(fn=stream_response_local)
            

# Markdown(str('''
# \
# **NOTE: please dont use the above URL, use this instead: [{gradio_url}]({gradio_url})**
# '''.format(gradio_url='/proxy/7860')))

demo1.launch(server_name="0.0.0.0", ssl_verify=False, inline=False)




* Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.




**NOTE: To access the On-Prem HuggingFace LLM Demo, please dont use the above URL, use this instead: `http://<YOUR-VM-IP-ADDRESS>:7860`**\
The gradio frontend for this specific demo is served on port `7860` \
For example, if my VM IP address is 172.27.193.230, go to  `http://172.27.193.230:7860`

## Inferencing with On-Prem NVIDIA Inference Microservices (NIM)
In this section, running the following code cells will consume a locally-deployed Llama-3.1-8B-Instruct NIM container and serve them for inferncing.

In [8]:
# llama-parse is async-first, running the async code in a notebook requires the use of nest_asyncio
import nest_asyncio

nest_asyncio.apply()

### Sample inferencing with On-Prem NIM model
In the following code cell, we will first specify the NIM endpoint serving Llama-3.1-8B-Instruct model, and execute a sample inference afterwards

In [9]:
from llama_index.llms.nvidia import NVIDIA
from llama_index.core.llms import ChatMessage

# connect to an chat NIM running at localhost:8000, spcecifying a specific model
local_nim_llm = NVIDIA(
    base_url="http://localhost:8000/v1", model="meta/llama-3.1-8b-instruct"
)

# content = ""
# for completion in local_nim_llm.stream_complete("What is Dell Technologies?"):
#     content += completion.delta
#     print(completion.delta, end="")

messages = [
    ChatMessage(role="system", content="You are a helpful assistant."),
    ChatMessage(role="user", content="What is Dell Technologies?"),
]

content = ""
for completion in local_nim_llm.stream_chat(messages):
    content += completion.delta
    print(completion.delta, end="")

Dell Technologies is a multinational computer technology company that designs, develops, manufactures, markets, and supports computers and other electronic devices. The company was founded in 1984 by Michael Dell and is headquartered in Round Rock, Texas, USA.

Dell Technologies is a leading provider of a wide range of products and services, including:

1. **Personal computers**: Desktops, laptops, tablets, and mobile devices.
2. **Servers**: Data center servers, storage systems, and networking equipment.
3. **Storage**: Storage solutions, including hard disk drives, solid-state drives, and storage arrays.
4. **Networking**: Networking equipment, including switches, routers, and wireless access points.
5. **Virtualization**: Virtualization software and services, including VMware.
6. **Cloud computing**: Cloud infrastructure, platform, and software as a service (IaaS, PaaS, SaaS).
7. **Cybersecurity**: Cybersecurity solutions, including threat detection, incident response, and security 

### Creating a Gradio UI for interacting with On-Prem NIM
We will now create a Gradio frontend to consume the deployed On-Prem NIM model. \
Launching Gradio clients will consume ports incrementally starting from port `7860` \
As this is the second demo we are launching concurrently, it will be served on port `7861`

In [10]:
import gradio as gr
from IPython.display import Markdown
from llama_index.core.llms import ChatMessage

css = """
.app-interface {
    height:80vh;
}
.chat-interface {
    height: 75vh;
}
.file-interface {
    height: 40vh;
}
"""

def stream_response_local_nim(message, history):
    messages = [
        ChatMessage(role="system", content="You are a helpful assistant."),
        ChatMessage(role="user", content=message),
    ]
    response = local_nim_llm.stream_chat(messages)
    res = ""
    for token in response:
        # print(token, end="")
        res = str(res) + str(token.delta)
        yield res

with gr.Blocks(css=css) as demo3:
    gr.Markdown(
    """
    <h1 style="text-align: center;">Local NIM Chatbot 💻📑✨</h3>
    """)
    with gr.Row(equal_height=False, elem_classes=["app-interface"]):
        with gr.Column(scale=4, elem_classes=["chat-interface"]):
            test = gr.ChatInterface(fn=stream_response_local_nim)



Markdown(str('''
\
**NOTE: please dont use the above URL, use this instead: [{gradio_url}]({gradio_url})**
'''.format(gradio_url='/proxy/7862')))

demo3.launch(server_name="0.0.0.0", ssl_verify=False, inline=False)





* Running on local URL:  http://0.0.0.0:7861

To create a public link, set `share=True` in `launch()`.




**NOTE: To access the On-Prem NIMs Demo, please dont use the above URL, use this instead: `http://<YOUR-VM-IP-ADDRESS>:7861`**\
The gradio frontend for this specific demo is served on port `7861` \
For example, if my VM IP address is 172.27.193.230, go to  `http://172.27.193.230:7861`

## Inferencing with Hosted NVIDIA Inference Microservices (NIM)
In this section, running the following code cells will consume a cloud-hosted Llama-3.1-8B-Instruct NIM.

### Sample inferencing with Hosted NIM model
In the following code cell, we will first specify the NIM endpoint serving Llama-3.1-8B-Instruct model, and execute a sample inference afterwards

In [11]:
from llama_index.llms.nvidia import NVIDIA
from llama_index.core.llms import ChatMessage

# connect to an chat NIM running at localhost:8080, spcecifying a specific model
hosted_nim_llm = NVIDIA(
    model="meta/llama-3.1-8b-instruct"
)

# content = ""
# for completion in hosted_nim_llm.stream_complete("What is Dell Technologies?"):
#     content += completion.delta
#     print(completion.delta, end="")

messages = [
    ChatMessage(role="system", content="You are a helpful assistant."),
    ChatMessage(role="user", content="What is Dell Technologies?"),
]

content = ""
for completion in hosted_nim_llm.stream_chat(messages):
    content += completion.delta
    print(completion.delta, end="")

Dell Technologies is a multinational computer technology company that designs, manufactures, sells, and supports computers and other electronic devices. The company was founded in 1984 by Michael Dell and is headquartered in Round Rock, Texas.

Dell Technologies is a leading provider of a wide range of products and services, including:

1. **Personal computers**: Desktops, laptops, tablets, and 2-in-1 devices for consumers and businesses.
2. **Servers**: Data center servers, storage systems, and networking equipment for businesses and organizations.
3. **Storage**: External hard drives, solid-state drives, and storage arrays for data storage and management.
4. **Networking**: Switches, routers, and other networking equipment for businesses and organizations.
5. **Virtualization**: Software and services for virtualizing and managing IT infrastructure.
6. **Cloud computing**: Cloud-based services, including infrastructure as a service (IaaS), platform as a service (PaaS), and software as

### Creating a Gradio UI for interacting with Hosted NIM
We will now create a Gradio frontend to consume the deployed hosted NIM model. \
Launching Gradio clients will consume ports incrementally starting from port `7860` \
As this is the second demo we are launching concurrently, it will be served on port `7862`

In [12]:
import gradio as gr
from IPython.display import Markdown
from llama_index.core.llms import ChatMessage

css = """
.app-interface {
    height:80vh;
}
.chat-interface {
    height: 75vh;
}
.file-interface {
    height: 40vh;
}
"""

Markdown(str('''
\
**NOTE: please dont use the above URL, use this instead: [{gradio_url}]({gradio_url})**
'''.format(gradio_url='/proxy/7861')))

def stream_response_hosted_nim(message, history):
    messages = [
        ChatMessage(role="system", content="You are a helpful assistant."),
        ChatMessage(role="user", content=message),
    ]
    response = hosted_nim_llm.stream_chat(messages)
    res = ""
    for token in response:
        # print(token, end="")
        res = str(res) + str(token.delta)
        yield res

with gr.Blocks(css=css) as demo2:
    gr.Markdown(
    """
    <h1 style="text-align: center;">Hosted NIM Chatbot 💻📑✨</h3>
    """)
    with gr.Row(equal_height=False, elem_classes=["app-interface"]):
        with gr.Column(scale=4, elem_classes=["chat-interface"]):
            test = gr.ChatInterface(fn=stream_response_hosted_nim)
            

demo2.launch(server_name="0.0.0.0", ssl_verify=False, inline=False)



* Running on local URL:  http://0.0.0.0:7862

To create a public link, set `share=True` in `launch()`.




Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


**NOTE: To access the Hosted NIMs Demo, please dont use the above URL, use this instead: `http://<YOUR-VM-IP-ADDRESS>:7862`**\
The gradio frontend for this specific demo is served on port `7862` \
For example, if my VM IP address is 172.27.193.230, go to  `http://172.27.193.230:7862`