# Lab 2: Exploring Ollama

## Overview
Many of our labs will require the use of an LLM. Rather than using an online hosted commercial LLM with all of the associated fees, we will use a containerized version of Ollama serving a 3 billion parameter model.

## Goals

 * Pull and serve the Llama3 model in an Ollama container.
 * Understand how to use the HTTP API to interact with the model.
 * Develop an understanding of how context drives chatbots.

## Estimated Time: 60 minutes

# <img src="../images/task.png" width=20 height=20> Task 2.1

## Pulling a Trained Model

When you issued the initial `docker compose up` command, several different services were started. One of those is the Jupyter system through which you are interacting with these exercises. Another set of services supports a *vector store* that we will be using later in the class. The last is a container that is hosting and serving Ollama.

Ollama is an open source project under the MIT license design to host and serve various open LLMs. In this course, we will make use of the `llama3` model, but feel free to experiment with any of them. We have chosen this particular model because it is small enough to fit within the contraints of the system requirements that were specified for this course.

When the container is first started, it is ready to work but has no model loaded. We need to interact with it a bit to instruct Ollama to `pull` the `llama3` model. To do this we will interact with the HTTP API that it provides on port 11434.

To begin with, we need to load some Python libraries so that we can issue HTTP API calls easily. Please import the `requests` and `json` libraries.

# <img src="../images/task.png" width=20 height=20> Task 2.2

Let's start by verifying that Ollama is reachable. Normally we would need to either have Ollama running locally, know its IP address, or know its fully qualified domain name. In our cases, since we are running all of these containers together, we can take advantage of the automatic naming that containerization solutions provide.

Using the host name `ollama`, send an HTTP GET request to that host on port `11434` and examine the returned content. This can be done using the `requests.get()` method.

# <img src="../images/task.png" width=20 height=20> Task 2.3

The `Ollama is running` response tells us that Ollama is up and running, however there's still another step that must be taken. Ollama provides a platform that can download and serve a number of different models. While Ollama is running, we have not downloaded any models.

Let's prove this and demonstrate how to send data to the API. Since we are sending data, we must use an HTTP `POST` rather than a `GET`, which we just did. To send a `POST` we can use `requests.post()`. This will, however, require a few more arguments:

 * We must configure the *request headers* to specify the data type we are sending and that we wish to receive. This should be a dictionary containing `{'Content-Type':'application/json'}`.
 * We must also send a JSON body. This can also be built as a Python dictionary with the following keys and values:
   - A `model` key with the value `llama3`, which is the model we wish to use. This parameter allows us to select the model used to generate responses.
   - A `prompt` key with the text or prompt we want completed. Let's use `What is 42?`.
   - A `stream` key with the value `False`. This key allows us to control whether the response is returned as a single response or streamed as individual tokens are generated by the model. For now, let's be patient and wait for the entire response.

Our request must also be sent to a different API endpoint. To ask a model to generate text, the URL we must use is `http://ollama:11434/api/generate`.

Please use the empty cell below to generate a `POST` request to the Ollama container. Use the `requests.post()` method, passing the URL, the headers and the data. The headers should be passed using kwarg `headers` and the data should be sent using kwarg `data`.

# <img src="../images/task.png" width=20 height=20> Task 2.4

As predicted, the server reports that we do not have a model loaded with the message, `b'{"error":"model \\"llama3\\" not found, try pulling it first"}'`.

To tell Ollama to pull the model, we must use the `/api/pull` API endpoint. To use this endpoint, we must configure the data that we send with the name of the model to pull.

Use the following cell to send a `POST` request to the `/api/pull` endpoint. This time, the `data` parameter should be configured as:

`data = {"name":"llama3", "stream":False}`


# <img src="../images/task.png" width=20 height=20> Task 2.5

Running this cell will require some patience. In fact, if you watch the command line from which you ran `docker compose up` you will see Ollama messages detailing the download progress. Please be patient. Depending on your Internet connection speed, this could take several minutes to complete. Once it does complete, you should see the final status message indicating `"success"`.

With the model now downloaded, we should be able to send a query. Please resend the same request from **Task 2.3**. Capture the `.content` of the request in a variable named `response`. (For example, `response = requests.post(URL).content`)

# <img src="../images/task.png" width=20 height=20> Task 2.6

This cell may take 30 seconds or more to run. When it completes, you will see the asterisk turn into a number, indicating completion, but you should not see any output since we captured the content of the result into a variable. Please execute the following cell to examine the content returned.

# <img src="../images/task.png" width=20 height=20> Task 2.7

First, the exact response that you receive may be different from the result shown above in the solutions notebook. This is because there is a bit of randomness added to the next-word result in the model. Take a few moments to examine the response. You should be able to find the following things:

 * `model` key, indicating this result is from `llama3`.
 * `created_at` key, telling you when the response was generated.
 * `response` key, providing the complete response as a string.
 * `done` key, indicating that the response is complete.
 * `done_reason` key, telling us why the model stopped.
 * `context` key, providing a list of the token indices including the prompt and the response.`
 * `total_duration` key, indicating the number of nanoseconds spent generating the resopnse.
 * `load_duration` key, indicating the number of nanoseconds spent loading the model.
 * `prompt_eval_count` key, the number of tokens in the prompt.
 * `prompt_eval_duration` key, the time in nanoseconds spent evaluating the prompt.
 * `eval_count` key, indicating the total number of tokens sent in the response.
 * `eval_duration` key, detailing the number of nanoseconds spent generating the response.

To make the response easier to work with, let's convert it into a Python dictionary. This can be done using the `json.loads()` function.

Use the next cell to decode the `response` into a Python dictionary named `response`. Once this is done, print out the `'response'` key from this dictionary.

# <img src="../images/task.png" width=20 height=20> Task 2.8

Wonderful! While the response is not fast, we are able to send queries and have them answered. What about speed?

First, we will make no attempt to speed things up in this class. The reason that the model response seems slow is twofold. First, we have not done anything to attempt to get any GPUs in the system properly configured, nor have we attempted (nor will we) to install GPU driver support into Docker or Kubernetes. If your organization is planning to deploy this type of model you should *definitely* investigate which GPUs make the most sense for your applications, your platforms (containerized or not), and your systems.

The second reason this seems so slow is that we do not see anything until the entire response has been generated. To improve our experience during class (and for any interactive chat app you might build), let's change how we're making the request.

The `"stream"` option in the JSON request, when set to `True`, will stream chunks of the response (tokens) as they become available. This sounds much more pleasant, but it requires a bit of a different approach in our Python code.

Please consider the Python code in the cell below and, when you have a good handle on what it is doing, execute the cell.

In [None]:
def get_stream(url, data):
    session = requests.Session()

    with session.post(url, data=data, stream=True) as resp:
        for line in resp.iter_lines():
            if line:
                token = json.loads(line)["response"]
                print(token, end='')

data = {"model":"llama3", "prompt": "Which LLM are you?", "stream":True}
url = 'http://ollama:11434/api/generate'

get_stream(url, json.dumps(data))

# <img src="../images/task.png" width=20 height=20> Task 2.9

Wow, that's much better! It's still taking the model a while to generate the answer, but the delay is much more tolerable since we can see what it is doing. Before concluding this lab, let's investigate the `"context"` value and see how it can be used. First:

Using the next cell and the techniques above, ask the model, "Who was Macbeth?"

# <img src="../images/task.png" width=20 height=20> Task 2.10

That response seems completely reasonable. In the event you are looking at the solution while working through this on your own, do not be concerned if the response generated by your model is not identical. No doubt it includes the highlights; specifically, something about Macbeth being a fictional Shakespearean character based on a real historical figure.

Using the next cell and the techniques above, ask the model, "What did the witches say about him?"

# <img src="../images/task.png" width=20 height=20> Task 2.11

What happened? The model acts like it has no idea what we are talking about!

The problem is that every prompt that we send to the model is viewed as a completely discrete event. Unless we do something to remind the model about the history of our conversation, it will have no way to connect the second question to the first, resulting in a response that isn't particularly useful. This brings us to the `"context"` field.

The context is a list of tokens that the model returns to us, providing us information in the form of token numbers about the question and the response that the model generates. If we store this value from the response to our first question and then send it in our second question, the model will perform as we might expect. Let's try it.

 * Redefine the `get_stream()` function such that it returns the `'context'` array from the JSON object in the last part of the stream.
 * Capture this value in a variable
 * Use this new function to re-send the question, "Who was Macbeth?"

# <img src="../images/task.png" width=20 height=20> Task 2.12

Now that we have the initial answer and the context, we are ready to ask the second question. We just have to remember to send the context value in the `data` object.

 * Add a `"context"` key to the `data` dictionary with the context array that was returned in the last cell.
 * Ask the model, "What did the witches say about him?", sending the context in the data.

# Conclusion

In this lab we have accomplished some important things and learned some useful techniques:

 * We now have the Llama 3 model installed in our Ollama container.
 * We know how to interact with the API to pull models.
 * We know how to interact with the API to send questions.
 * We understand the function of the `stream` attribute and have code that allows us to receive and print out each part of the response as it arrives.
 * We understand how the `context` is returned and how it can be included in a subsequent query to continue the "conversation."