# Large Language Model Serving Tutorial with DigitalHub

This notebook demonstrates how to deploy and serve a pre-trained Large Language Model using KubeAI with the DigitalHub SDK. We'll work with LLama model for text generation


## Project Initialization

Initialize a DigitalHub project using consistent naming with other tutorials.

In [None]:
import digitalhub as dh
import getpass as gt

USERNAME = gt.getuser()

project = dh.get_or_create_project(f"{USERNAME}-tutorial-project")
print(project.name)

## Step 1: Model Configuration

We'll create a function to serve the LLama3.2 model directly from HuggingFace Hub.
The model path uses the `hf://` protocol to directly reference models from the HuggingFace Hub without manual downloading.

In [None]:
llm_function = project.new_function(
    name="llama32-1b",
    kind="kubeai-text",
    model_name=f"{USERNAME}-model",
    url="ollama://llama3.2:1b",
    engine='OLlama',
    features=['TextGeneration']
)

## Step 2: Model Serving

Now we'll deploy our LLM model. We're using a GPU profile (`1xa100`) to accelerate the generation.

In [None]:
llm_run = llm_function.run("serve", profile="1xa100", wait=True)

Let's check that our service is running and ready to accept requests:

In [None]:
service = llm_run.refresh().status.service
print("Service status:", service)

When the service is ready, we need to wait for the model to be downloaded and deployed.

In [None]:
status = llm_run.refresh().status.k8s.get("Model")['status']
print("Model status:", status)

We can check the logs for the main container if needed

In [None]:
import base64
log = base64.b64decode(llm_run.refresh().logs()[0]["content"]).decode('utf-8')

In [None]:
print(log)

### Test the LLM API

Now let's test our deployed model with a prompt.

In [None]:
model_name =llm_run.refresh().status.k8s.get("Model").get("metadata").get("name")
json_payload = {'model': model_name, 'prompt': 'how can i use a PAT with the DigitalHub?'}

In [None]:
import requests
import pprint
pp = pprint.PrettyPrinter(indent=2)

url = service['url']+'/v1/completions'

r = requests.post(url, json=json_payload)
print(f"Status Code: {r.status_code}")
pp.pprint(r.json())

The SDK exposes an helper method for invoking the service, eliminating the need for custom HTTP request handling.

In [None]:
import pprint
pp = pprint.PrettyPrinter(indent=2)

r = llm_run.invoke(json=json_payload, url=service['url']+'/v1/completions').json()
pp.pprint(r)


### Understanding the Results

The model returns a text with the completition of the prompt, along with usage information which can be used for monitoring or billing.

## Exercises

* check that the model is usable via OpenWebUI
* check logs and metrics from the console

