# NVIDIA NIMs

The [llama-index-llms-nvidia](https://pypi.org/project/llama-index-llms-nvidia/) package contains LlamaIndex integrations for chat model powered by the [NVIDIA AI Foundation Model](https://www.nvidia.com/en-us/ai-data-science/foundation-models/) and hosted on [NVIDIA API Catalog.](https://build.nvidia.com/)

NVIDIA AI Foundation models are community and NVIDIA-built models and are NVIDIA-optimized to deliver the best performance on NVIDIA accelerated infrastructure.  Using the API, you can query live endpoints available on the NVIDIA API Catalog to get quick results from a DGX-hosted cloud compute environment. All models are source-accessible and can be deployed on your own compute cluster using NVIDIA NIM which is part of NVIDIA AI Enterprise.

Models can be exported from NVIDIA’s API catalog with NVIDIA NIM, which is included with the NVIDIA AI Enterprise license, and run them on-premises, giving Enterprises ownership of their customizations and full control of their IP and AI application. NIMs are packaged as container images on a per model/model family basis and are distributed as NGC container images through the NVIDIA NGC Catalog. At their core, NIMs are containers that provide interactive APIs for running inference on an AI Model.

# NVIDIA's LLM connector

This example goes over how to use LlamaIndex to interact with and develop LLM-powered systems using the publicly-accessible AI Foundation endpoints.

With this connector, you'll be able to connect to and generate from compatible models available as hosted [NVIDIA NIMs](https://ai.nvidia.com), such as:

- Google's [gemma-7b](https://build.nvidia.com/google/gemma-7b)
- Mistal AI's [mistral-7b-instruct-v0.2](https://build.nvidia.com/mistralai/mistral-7b-instruct-v2)
- And more!

## Installation

In [None]:
%pip install --upgrade --quiet llama-index-llms-nvidia llama-index-embeddings-openai llama-index-readers-file

## Setup

**To get started:**

1. Create a free account with [NVIDIA](https://build.nvidia.com/), which hosts NVIDIA AI Foundation models.

2. Click on your model of choice.

3. Under Input select the Python tab, and click `Get API Key`. Then click `Generate Key`.

4. Copy and save the generated key as NVIDIA_API_KEY. From there, you should have access to the endpoints.

In [59]:
import getpass
import os

# del os.environ['NVIDIA_API_KEY']  ## delete key and reset
if os.environ.get("NVIDIA_API_KEY", "").startswith("nvapi-"):
    print("Valid NVIDIA_API_KEY already in environment. Delete to reset")
else:
    nvapi_key = getpass.getpass("NVAPI Key (starts with nvapi-): ")
    assert nvapi_key.startswith("nvapi-"), f"{nvapi_key[:5]}... is not a valid key"
    os.environ["NVIDIA_API_KEY"] = nvapi_key

Valid NVIDIA_API_KEY already in environment. Delete to reset


In [60]:
# llama-parse is async-first, running the async code in a notebook requires the use of nest_asyncio
import nest_asyncio

nest_asyncio.apply()

## Working with NVIDIA API Catalog

In [61]:
from llama_index.llms.nvidia import NVIDIA
from llama_index.core.llms import ChatMessage, MessageRole

llm = NVIDIA()

messages = [
    ChatMessage(
        role=MessageRole.SYSTEM, content=("You are a helpful assistant.")
    ),
    ChatMessage(
        role=MessageRole.USER,
        content=("What are the most popular house pets in North America?"),
    ),
]

llm.chat(messages)

ChatResponse(message=ChatMessage(role=<MessageRole.ASSISTANT: 'assistant'>, content="According to various surveys and data, the most popular house pets in North America are:\n\n1. Dogs: With over 78 million households owning a dog, they are the most popular pet in North America. Breeds like Labrador Retrievers, German Shepherds, and Golden Retrievers are among the most popular.\n2. Cats: With over 65 million households owning a cat, they are the second most popular pet. Breeds like Siamese, Persian, and Maine Coon are popular among cat owners.\n3. Fish: With over 12 million households owning fish, they are a popular choice for those who want a low-maintenance pet.\n4. Birds: With over 7 million households owning birds, they are a popular choice for those who enjoy their songs and colorful plumage.\n5. Small mammals: Guinea pigs, hamsters, and rabbits are popular pets among children and adults alike.\n\nIt's worth noting that these numbers can vary depending on the region, city, and eve

## Working with NVIDIA NIMs

In addition to connecting to hosted [NVIDIA NIMs](https://ai.nvidia.com), this connector can be used to connect to local microservice instances. This helps you take your applications local when necessary.

For instructions on how to setup local microservice instances, see https://developer.nvidia.com/blog/nvidia-nim-offers-optimized-inference-microservices-for-deploying-ai-models-at-scale/

In [None]:
from llama_index.llms.nvidia import NVIDIA

# connect to an chat NIM running at localhost:2016
embedder = NVIDIA(base_url="http://localhost:2016/v1")

## Loading a specific model
Now we can load our `NVIDIA` LLM by passing in the model name, as found in the docs - located [here](https://docs.api.nvidia.com/nim/reference/)

> NOTE: The default model is `mistralai/mistral-7b-instruct-v0.2`.

In [62]:
llm = NVIDIA(model="mistralai/mistral-7b-instruct-v0.2")


We can observe which model our `llm` object is currently associated with the `.model` attribute.

In [63]:
llm.model

'mistralai/mistral-7b-instruct-v0.2'

## Basic Functionality

Now we can explore the different ways you can use the connector within the LlamaIndex ecosystem!

Before we begin, lets set up a list of `ChatMessage` objects - which is the expected input for some of the methods.

We'll follow the same basic pattern for each example: 

1. We'll point our `NVIDIA` LLM to our desired model
2. We'll examine how to use the endpoint to achieve the desired task!

### Complete: `.complete()`

We can use `.complete()`/`.acomplete()` (which takes a string) to prompt a response from the selected model.

Let's use our default model for this task.

In [64]:
completion_llm = NVIDIA()

We can verify this is the expected default by checking the `.model` attribute.

In [65]:
completion_llm.model

'meta/llama3-8b-instruct'

Let's call `.complete()` on our model with a string, in this case `"Hello!"`, and observe the response.

In [66]:
completion_llm.complete("Hello!")

CompletionResponse(text="Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat?", additional_kwargs={}, raw={'id': 'chatcmpl-932b7cc0-42ca-449e-8313-ee7ea65e22d8', 'choices': [Choice(finish_reason=None, index=0, logprobs=ChoiceLogprobs(content=[ChatCompletionTokenLogprob(token=None, bytes=None, logprob=0.0, top_logprobs=None), ChatCompletionTokenLogprob(token=None, bytes=None, logprob=0.0, top_logprobs=None)]), message=ChatCompletionMessage(content="Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat?", role='assistant', function_call=None, tool_calls=None))], 'created': 1716592683, 'model': 'meta/llama3-8b-instruct', 'object': 'chat.completion', 'system_fingerprint': None, 'usage': CompletionUsage(completion_tokens=25, prompt_tokens=12, total_tokens=37)}, logprobs=None, delta=None)

As is expected by LlamaIndex - we get a `CompletionResponse` in response.

#### Async Complete: `.acomplete()`

There is also an async implementation which can be leveraged in the same way!

In [67]:
await completion_llm.acomplete("Hello!")

CompletionResponse(text="Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat?", additional_kwargs={}, raw={'id': 'chatcmpl-6bd9e622-a725-4ac8-a2dc-6949a59af0d4', 'choices': [Choice(finish_reason=None, index=0, logprobs=ChoiceLogprobs(content=[ChatCompletionTokenLogprob(token=None, bytes=None, logprob=0.0, top_logprobs=None), ChatCompletionTokenLogprob(token=None, bytes=None, logprob=0.0, top_logprobs=None)]), message=ChatCompletionMessage(content="Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat?", role='assistant', function_call=None, tool_calls=None))], 'created': 1716592685, 'model': 'meta/llama3-8b-instruct', 'object': 'chat.completion', 'system_fingerprint': None, 'usage': CompletionUsage(completion_tokens=25, prompt_tokens=12, total_tokens=37)}, logprobs=None, delta=None)

#### Chat: `.chat()`

Now we can try the same thing using the `.chat()` method. This method expects a list of chat messages - so we'll use the one we created above.

We'll use the `mistralai/mixtral-8x7b-instruct-v0.1` model for the example.

In [68]:
chat_llm = NVIDIA(model="mistralai/mixtral-8x7b-instruct-v0.1")

All we need to do now is call `.chat()` on our list of `ChatMessages` and observe our response.

You'll also notice that we can pass in a few additional key-word arguments that can influence the generation - in this case, we've used the `seed` parameter to influence our generation and the `stop` parameter to indicate we want the model to stop generating once it reaches a certain token!

> NOTE: You can find information about what additional kwargs are supported by the model's endpoint by referencing the API documentation for the selected model. Mixtral's is located [here](https://docs.api.nvidia.com/nim/reference/mistralai-mixtral-8x7b-instruct-infer) as an example!

In [69]:
chat_llm.chat(messages, seed=4, stop=["cat", "cats", "Cat", "Cats"])

ChatResponse(message=ChatMessage(role=<MessageRole.ASSISTANT: 'assistant'>, content=" In North America, the most popular types of house pets are:\n\n1. Dogs: Man's best friend is the most popular pet in North America. They are known for their loyalty, companionship, and the variety of breeds that can fit different lifestyles.\n\n2. Cats", additional_kwargs={}), raw={'id': 'chatcmpl-6d736bb5-12b3-4385-af97-8e48ff7d9353', 'choices': [Choice(finish_reason=None, index=0, logprobs=ChoiceLogprobs(content=[ChatCompletionTokenLogprob(token=None, bytes=None, logprob=0.0, top_logprobs=None), ChatCompletionTokenLogprob(token=None, bytes=None, logprob=0.0, top_logprobs=None), ChatCompletionTokenLogprob(token=None, bytes=None, logprob=0.0, top_logprobs=None)]), message=ChatCompletionMessage(content=" In North America, the most popular types of house pets are:\n\n1. Dogs: Man's best friend is the most popular pet in North America. They are known for their loyalty, companionship, and the variety of b

As expected, we receive a `ChatResponse` in response.

#### Async Chat: (`achat`)

We also have an async implementation of the `.chat()` method which can be called in the following way.

In [70]:
await chat_llm.achat(messages)

ChatResponse(message=ChatMessage(role=<MessageRole.ASSISTANT: 'assistant'>, content=" In North America, the most popular types of house pets are:\n\n1. Dogs: Man's best friend is the most popular pet in North America. They are known for their loyalty, companionship, and the variety of breeds that can fit different lifestyles.\n\n2. Cats: These are the second most popular pets due to their independence, low maintenance, and longevity. They also come in various breeds, each with its unique personality and appearance.\n\n3. Fish: Freshwater and saltwater fish are popular choices for pet owners who prefer a quieter, less interactive pet. They can be soothing to watch and require less daily care than dogs or cats.\n\n4. Birds: Birds like canaries, parakeets, and parrots can make wonderful pets. They are intelligent, social, and some species can even learn to mimic human speech.\n\n5. Small mammals: This group includes pets such as hamsters, guinea pigs, rabbits, and reptiles like turtles an

### Stream: `.stream_chat()`

We can also use the models found on `build.nvidia.com` for streaming use-cases!

Let's select another model and observe this behaviour. We'll use Google's `gemma-7b` model for this task.

In [71]:
stream_llm = NVIDIA(model="google/gemma-7b")

Let's call our model with `.stream_chat()`, which again expects a list of `ChatMessage` objects, and capture the response.

In [72]:
streamed_response = stream_llm.stream_chat(messages)

In [73]:
streamed_response

<generator object llm_chat_callback.<locals>.wrap.<locals>.wrapped_llm_chat.<locals>.wrapped_gen at 0x2b17e4f20>

As we can see, the response is a generator with the streamed response. 

Let's take a look at the final response once the generation is complete.

In [74]:
last_element = None
for last_element in streamed_response:
    pass

print(last_element)

assistant: **Top Popular House Pets in North America:**

**1. Dogs:**
* Estimated 63.4 million pet dogs in households (2023)
* Known for their loyalty, companionship, and trainability

**2. Cats:**
* Estimated 38.4 million pet cats in households (2023)
* Known for their independence, affection, and low-maintenance nature

**3. Fish:**
* Estimated 14.5 million pet fish in households (2023)
* Popular for their tranquility, beauty, and variety of species

**4. Small mammals (guinea pigs, hamsters, rabbits):**
* Estimated 14.4 million pet small mammals in households (2023)
* Known for their playful and affectionate nature

**5. Birds:**
* Estimated 13.3 million pet birds in households (2023)
* Known for their beauty, song, and intelligence

**Other popular pets:**

* Tortoises and reptiles
* Hamsters and rodents
* Invertebrates (such as spiders and hermit crabs)

**Factors influencing pet popularity:**

* **Lifestyle and living situation:** Urban dwellers are more likely to have cats, whil

#### Async Stream: `.astream_chat()`

We have the equivalent async method for streaming as well, which can be used in a similar way to the sync implementation.

In [75]:
streamed_response = await stream_llm.astream_chat(messages)

In [76]:
streamed_response

<async_generator object llm_chat_callback.<locals>.wrap.<locals>.wrapped_async_llm_chat.<locals>.wrapped_gen at 0x2b1708dc0>

In [77]:
last_element = None
async for last_element in streamed_response:
    pass

print(last_element)

assistant: **Top Popular House Pets in North America:**

**1. Dogs:**
* Estimated 63.4 million pet dogs in households (2023)
* Known for their loyalty, companionship, and trainability

**2. Cats:**
* Estimated 38.4 million pet cats in households (2023)
* Known for their independence, affection, and low-maintenance nature

**3. Fish:**
* Estimated 14.5 million pet fish in households (2023)
* Popular for their tranquility, beauty, and variety of species

**4. Small mammals (guinea pigs, hamsters, rabbits):**
* Estimated 14.4 million pet small mammals in households (2023)
* Known for their playful and affectionate nature

**5. Birds:**
* Estimated 13.3 million pet birds in households (2023)
* Known for their beauty, song, and intelligence

**Other popular pets:**

* Tortoises and reptiles
* Hamsters and rodents
* Invertebrates (such as spiders and hermit crabs)

**Factors influencing pet popularity:**

* **Lifestyle and living situation:** Urban dwellers are more likely to have cats, whil

## Streaming Query Engine Responses

Let's look at a slightly more involved example using a query engine!

We'll start by loading some data (we'll be using the [Hitchhiker's Guide to the Galaxy](https://web.eecs.utk.edu/~hqi/deeplearning/project/hhgttg.txt)).

### Loading Data

Let's first create a directory where our data can live.

In [None]:
!mkdir -p 'data/hhgttg'

We'll download our data from the above source.

In [None]:
!wget 'https://web.eecs.utk.edu/~hqi/deeplearning/project/hhgttg.txt' -O 'data/hhgttg/hhgttg.txt'

We'll need to have an embedding model for this step! We'll use NVIDIA `ai-embed-qa-4` model to achieve this, and save it in our `Settings`.

In [82]:
from llama_index.embeddings.nvidia import NVIDIAEmbedding
from llama_index.core import Settings

embedder = NVIDIAEmbedding(model="NV-Embed-qa", truncate="END")
Settings.embed_model = embedder

Now we can load our document and create an index leveraging the above created `OpenAIEmbedding()`.

In [81]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader("data/hhgttg").load_data()
index = VectorStoreIndex.from_documents(documents)

NotFoundError: Error code: 404 - {'type': 'urn:inference-service:problem-details:not-found', 'title': 'Not Found', 'status': 404, 'detail': 'Inference error', 'instance': '/v2/nvcf/pexec/functions/09c64e32-2b65-4892-a285-2f585408d118', 'requestId': '0d6322c4-576a-41fc-bd67-4167e9e1e2d9'}

Now we can create a simple query engine and set our `streaming` parameter to `True`.

In [None]:
streaming_qe = index.as_query_engine(streaming=True)

Let's send a query to our query engine, and then stream the response.

In [None]:
streaming_response = streaming_qe.query(
    "What is the significance of the number 42?",
)

In [None]:
streaming_response.print_response_stream()