![NVIDIA Logo](images/nvidia.png)

# Working With NeMo Service Models

In this notebook you'll establish your first connection to NeMo Service, an NVIDIA cloud-hosted GPU service that we will be using throughout this workshop, and play around with doing text generation with some of the model's NeMo Service has to offer.

NeMo Service has many incredible features, and we won't cover them all today, but in the context of LLM customization, NeMo Service will provide us with a variety of models that we can interface with through its API, as well as the ability to easily perform a variety of parameter-efficient fine-tuning techniques that we will exploit to great effect today.

![NeMo Service](images/nemo_service.png)

---

## Learning Objectives

By the time you complete this notebook you will:
- Know how to establish a connection to NeMo Service through its `nemollm.api` Python package.
- Observe the variety of models available to us to use by way of the API.
- Generate your first large language model responses for the course.

---

## Imports

We will begin every notebook by performing the imports necessary for the current notebook.

Take note here of `from nemollm.api import NemoLLM` which imports the `NemoLLM` class that we will use ubiquitously in this course to communicate with NeMo Service.

In [None]:
import os

from nemollm.api import NemoLLM
from llm_utils.nemo_service_models import NemoServiceBaseModel

---

## Connect to LLM Service

Here is the boilerplate for establishing a connection to NeMo Service. For the workshop today we have provided an API key for your use.

In [None]:
api_key = os.getenv('NGC_API_KEY')
api_host = os.getenv('API_HOST')

In [None]:
conn = NemoLLM(
    api_host=api_host,
    api_key=api_key
)

---

## List Models

NeMo Service hosts quite a few models out of the box and will also host the model customizations we create later in the workshop. Here is our first time using the `conn` object to make a call to the service with an API call. In this case we are requesting to see the base models available to us.

In [None]:
response = conn.list_models()
models = {}

for model in response['models']:
    name = model.get('name')
    features = model.get('features')
    models[name] = features

In [None]:
models

As you can see we have access to a variety of NeMo GPT models and also community models like LLaMA-2-70B.

---

## Model List Helper

This course contains a package `llm_utils`. In order to reduce boilerplate, and also as a reference to you for later work, `llm_utils` contains quite a lot of code that will be of service to you today. Throughout the course as you are introduced to code imported from `llm_utils`, you are encouraged to check out the imported modules in the `llm_utils` directory to learn more about how we approached working efficiently in the context of model customization.

Our first imports are of enums we've created that we will use to make appropriate models available to us in specific notebooks. Each enum has a `list_models` method we will use to observe the models available to us. Here we list them all. You will see some overlap since some models are appropriate for use in multiple customization contexts.

The `value` property (on the right-hand side, for example `gpt-43b-001`) is the actual string name that NeMo Service expects when we want to interact with a model.

In [None]:
from llm_utils.models import Models, PtuneableModels, LoraModels

In [None]:
Models.list_models()

In [None]:
PtuneableModels.list_models()

In [None]:
LoraModels.list_models()

---

## Generating Model Responses

`conn.generate` is the method for sending prompts to NeMo Service LLMs and generating a response. As you can see from its docstring, it takes many mostly optional arguments to impact how the model generates a response. We will be introducing arguments to `generate` in the context of their use when appropriate in the workshop.

In [None]:
help(conn.generate)

---

## Your First LLM Generation

Here is the most basic possible way to generate a model response: pass `generate` a `model` name and a prompt. `conn.generate` will return a dict with details about the model's response.

In [None]:
response = conn.generate(
    model='gpt-43b-001',
    prompt='Tell me about parameter efficient fine-tuning.'
)

In [None]:
print(response)

---

## Changing Model Response With Additional Parameters

As you saw above, the response from `conn.generate` is by defaulta dict. In this course we will be almost entire focused on the quality of the text generated by the model and will prefer simple string outputs, which we can accomplish by setting `return_type='text'` as we do immediately below.

In the following cell we also set `tokens_to_generate=100` which will influence the model to generate output of roughly 100 tokens.

In [None]:
response = conn.generate(
    model='gpt-43b-001',
    prompt='Tell me about parameter efficient fine-tuning.',
    tokens_to_generate=100,
    return_type='text'
)

In [None]:
print(response)

---

## NeMo Service Model Utils

Rather than work directly with `conn.generate` we are going to primarily interact with NeMo Service models through a helper class `NemoServiceBaseModel`. We've built several conveniences into this class that we will utilize at appropriate times throughout the course.

In [None]:
from llm_utils.nemo_service_models import NemoServiceBaseModel

To use the class we instantiate an instance with the NeMo Service model we would like to use. Here we select the LLaMA-2 70B chat variant.

In [None]:
llm = NemoServiceBaseModel(Models.llama70b_chat.value)

`llm.generate` now behaves almost exactly like `conn.generate` except we don't need to pass the model name in every time we call it.

In [None]:
response = llm.generate('What is prompt engineering?')

In [None]:
print(response)

During iterative prompt engineering it's nice to not have to wait for the entire response to be generated before viewing it. To accomplish this you can set `return_type='stream'`.

In [None]:
llm.generate('Tell me about large language models.', return_type='stream')

Just as a reminder, all the key word arguments available to `conn.generate` can also be passed into `llm.generate`. Next we will touch on several which you will likely want to use during the workshop, along with a couple other common techniques, like white space stripping, that are very common when working with LLMs.

---

## Tokens to Generate

The `tokens_to_generate` named argument can control the maximum length of the model's response. Here we show a few examples of how changing its value can result in model responses of different lengths.

In [None]:
llm.generate('Tell me about large language models.', tokens_to_generate=300, return_type='stream')

---

Here we drastically reduce the number of tokens to generate. Note that doing so doesn't mean that the model will "complete its thoughts" by the specified length, only that the generation will stop after this number of tokens.

In [None]:
llm.generate('Tell me about large language models.', tokens_to_generate=30, return_type='stream')

---

`tokens_to_generate` is especially helpful when we observe that a model is providing more of a response than we would like and we are only interested in the first part of its response.

Here we give a toy example where we only want the model to give us a 'yes' or 'no' answer.

In [None]:
llm.generate('Is the Earth round?', tokens_to_generate=20, return_type='stream')

---

With `tokens_to_generate` we could capture only the part of the response we are interested in.

In [None]:
llm.generate('Is the Earth round?', tokens_to_generate=5, return_type='stream')

---

## White Space Stripping

You will almost always want to strip white space off your model's responses, which we can do with Python's `strip` method. Here is a simple LLM prompt, where you might notice some leading white space.

In [None]:
llm.generate('What is the capital of Califonia? Answer: ', tokens_to_generate=10)

Here we make the same call but use the Python string method `strip` to strip the extra, unwanted white space.

In [None]:
llm.generate('What is the capital of Califonia? Answer: ', tokens_to_generate=10).strip()

It's worth pointing out that when we set `return_type='stream'` that we are unable to call Python string methods on the response since the streaming functionality is not retuning a single string.

---

## Early Stopping

Another very common technique with LLM responses is to want to stop generation given the presence of a specific token, typically a newline character `'\n'` or some sentence-ending punctuation like a period `'.'`.

Here is our basic text generation prompt from earlier.

In [None]:
print(llm.generate('What is the capital of Califonia? Answer: ', tokens_to_generate=20).strip())

---

Here we will provide the `stop` named argument to `generate` to indicate the model should stop generating after the presence of a newline.

In [None]:
print(llm.generate('What is the capital of Califonia? Answer: ', stop=['\n'], tokens_to_generate=20).strip())

---

In this case we might accomplish something similar by stopping at periods.

In [None]:
print(llm.generate('What is the capital of Califonia? Answer: ', stop=['.'], tokens_to_generate=20).strip())

---

It's worth mentioning that `stop` expects a list of strings, so if you want, you can provide more that one stop character.

---

## Controlling Model Randomness

The named arguments `top_k`, `temperature` and `top_p` can influence the randomness of a model's responses.

Detailed coverage of these arguments is outside the scope of this workshop, but know that by default a model's response, given the same prompt, will be identical.

In [None]:
llm.generate('Write a haiku. Haiku: ', return_type='stream')

In [None]:
llm.generate('Write a haiku. Haiku: ', return_type='stream')

---

By setting `top_k` an integer, to a value greater than 1 we can indicate that the model should consider more than the one most likely possibility for which token comes next.

By setting `temperature`, a floating point value between 0 and 1, closer to 1, we can indicate that the model should even the probabilities of the possible next tokens.

In this workshop, during sections on synthetic data generation, there will be times when you will likely want to increase `top_k` and `temperature` to create diverse outputs given the same prompt, as we do so here.

In [None]:
llm.generate('Write a haiku. Haiku: ', top_k=3, temperature=.5, return_type='stream')

In [None]:
llm.generate('Write a haiku. Haiku: ', top_k=3, temperature=.5, return_type='stream')

---

## Warm Up Exercise

You're going to be working with instances of `NemoServiceBaseModel` throughout the workshop and one of the main goals for this notebook is to get you comfortable working with it.

To that end, before moving on to the next notebook, spend a few minutes trying out the following:
- creating a new instance of `NemoServiceBaseModel` but this time choosing a different model.
- Compare and contrast the output from the model you choose with that of the models we've already setup.
- Try using some of the possible named arguments to `generate` like `tokens_to_generate`, `stop`, `top_k`, and `temperature` to see how it effects model generation.

### Your Work Here