<!--
SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

> **Note:** This file has been modified from the original
> [NVIDIA AI Blueprint: Bring Your LLM to NIM](https://github.com/NVIDIA-AI-Blueprints/bring-llms-to-nim).
> Changes: Adapted for Red Hat AI Inference Server (RHAIIS) on OpenShift,
> replacing NVIDIA NIM Docker deployment with OpenShift Kubernetes resources.

# Deploy HuggingFace with RHAIIS on OpenShift

## What's Covered

This tutorial includes:
* **Example 1**: Basic deployment from HuggingFace (5 minutes)
* **Example 2**: Customizing model parameters
* **Example 3**: Deploying from local storage for offline use

## Setup: Python SDK with uv for IntelliJ

### Prerequisites Setup

This notebook uses `uv` for Python dependency management. Follow these steps to set up your environment:

#### 1. Initialize uv project
```bash
uv init --no-readme
```

#### 2. Add Jupyter dependencies to pyproject.toml
```bash
uv add jupyter notebook ipykernel ipywidgets
```

Or manually edit `pyproject.toml`:
```toml
dependencies = [
    "jupyter>=1.0.0",
    "notebook>=7.0.0",
    "ipykernel>=6.0.0",
    "ipywidgets>=8.0.0",
]
```

#### 3. Sync/install dependencies
```bash
uv sync
```

This creates a `.venv` virtual environment and installs all packages.

#### 4. Configure IntelliJ IDEA
1. Open **File → Project Structure → Project**
2. Click **SDK** → **Add SDK** → **Python SDK**
3. Select **Virtualenv Environment** → **Existing environment**
4. Browse to: `rhaiis-poc/.venv/bin/python`
5. Click **OK**

#### 5. Use the SDK in this notebook
1. In IntelliJ, select the Python interpreter (the one you just configured)
2. The notebook will use the Jupyter kernel from your `.venv`

---


### Utility Functions

Below are some utility functions we'll use in this notebook. These are for simplifying the process of deploying and monitoring NIMs in a notebook environment, and aren't required in general.



In [7]:
import requests
import time

def check_service_ready(url):
    """Fallback health check using HTTP endpoint"""
    url = f"http://{url}/health"
    print("Checking service health endpoint...")

    while True:
        try:
            response = requests.get(url, headers={'accept': 'application/json'})
            if response.status_code == 200 :
                print("✓ Service ready!")
                break
        except requests.ConnectionError:
            pass
        print("⏳ Still starting...")
        time.sleep(30)

def generate_text(url, model, prompt, max_tokens=1000, temperature=0.7):
    """Generate text using the NIM service"""
    try:
        response = requests.post(
            f"http://{url}/v1/chat/completions",
            json={
                "model": model,
                "messages": [{"role": "user", "content": prompt}],
                "max_tokens": max_tokens,
                "temperature": temperature
            },
            timeout=60
        )
        response.raise_for_status()
        return response.json()['choices'][0]['message']['content']
    except requests.exceptions.RequestException as e:
        print(f"Error making request: {e}")
        return None

print("✓ Utility functions loaded successfully")

✓ Utility functions loaded successfully


## Deployment Examples

Let's explore different ways to deploy models using NIM.

### Example 1: Basic Deployment from Hugging Face

This example shows how to deploy Llama-3.1-8B-Instruct with default settings directly from Hugging Face.


#### 1. Create the Secret custom resource (CR) for the Hugging Face token. The cluster uses the Secret CR to pull models from Hugging Face.

1.1 Set the HF_TOKEN variable using the token you set in Hugging Face.

In [None]:
!HF_TOKEN=<your_huggingface_token>

1.2 Set the cluster namespace to match where you deployed the Red Hat AI Inference Server image, for example:

In [None]:
!NAMESPACE=rhaiis-namespace

1.3 Create the Secret CR in the cluster:

In [None]:
!oc create secret generic hf-secret --from-literal=HF_TOKEN=$HF_TOKEN -n $NAMESPACE

#### 2. Create the Docker secret so that the cluster can download the Red Hat AI Inference Server image from the container registry. For example, to create a Secret CR that contains the contents of your local ~/.docker/config.json file, run the following command:

In [None]:
!oc create secret generic docker-secret --from-file=.dockercfg=$HOME/.docker/config.json --type=kubernetes.io/dockercfg -n rhaiis-namespace

#### 3. Create a PersistentVolumeClaim (PVC) custom resource (CR) and apply it in the cluster. You use the PVC as the location where you store the models that you download.

In [None]:
!oc apply -f ./1_basic/pvc.yaml -n rhaiis-namespace

#### 4. Create a Deployment custom resource (CR) that pulls the model from Hugging Face and deploys the Red Hat AI Inference Server container.

In [None]:
!oc apply -f ./1_basic/deployment.yaml -n rhaiis-namespace

#### 5. Create a Service CR for the model inference. For example:

In [None]:
!oc apply -f ./1_basic/service.yaml -n rhaiis-namespace

#### 6. Create a Route CR to enable public access to the model. For example:

In [None]:
!oc apply -f ./1_basic/route.yaml -n rhaiis-namespace

#### 7. Now let's test the deployed model:

Check if service is ready:

In [9]:
endpoint = !oc get route llama-3-1-8b-instruct -n rhaiis-namespace -o jsonpath='{.spec.host}'

# URL is a list, access the first element
print(endpoint[0])

check_service_ready(url=endpoint[0])


Logged into "https://api.ai-dev06.kni.syseng.devcluster.openshift.com:6443" as "xiezhang@redhat.com" using the token provided.

You have access to 90 projects, the list has been suppressed. You can list all projects with 'oc projects'

Using project "default".
llama-3-1-8b-instruct-xieshen-rhaiis.apps.ai-dev06.kni.syseng.devcluster.openshift.com
Checking service health endpoint...
✓ Service ready!


Test the Service:

In [10]:
result = generate_text(
    url=endpoint[0],
    model="meta-llama/Llama-3.1-8B-Instruct",
    prompt="Write a complete function that computes fibonacci numbers in Rust"
)
print(result if result else "Failed to generate text")

**Fibonacci Function in Rust**

Here is a simple function that computes Fibonacci numbers in Rust. This function uses recursion, which is a common approach for calculating Fibonacci numbers. However, please note that recursion can lead to a stack overflow for large inputs.

```rust
/// Calculate the nth Fibonacci number using recursion.
fn fibonacci_recursive(n: u32) -> u32 {
    match n {
        0 => 0,
        1 => 1,
        _ => fibonacci_recursive(n - 1) + fibonacci_recursive(n - 2),
    }
}

/// Calculate the nth Fibonacci number using iteration.
fn fibonacci_iterative(n: u32) -> u32 {
    if n <= 1 {
        return n;
    }

    let mut a = 0;
    let mut b = 1;
    for _ in 2..=n {
        let temp = a + b;
        a = b;
        b = temp;
    }
    b
}

fn main() {
    let n = 10; // Change this to the desired Fibonacci number
    println!("Fibonacci number at index {} (recursive): {}", n, fibonacci_recursive(n));
    println!("Fibonacci number at index {} (iterative): {}", n

### Example 2: Customizing Model Parameters

This example demonstrates how custom parameters affect model behavior. We'll deploy with specific constraints and test them:

**Key Parameters:**
* `--tensor-parallel-size=1`: Uses 1 GPU in parallel
* `--max-model-len=2048`: Limits model context length

#### 1. Update the Deployment custom resource (CR) with model parameters.

In [None]:
!oc apply -f ./2_custom_params/deployment.yaml -n rhaiis-namespace

#### 2. Now let's test the deployed model:

Check if service is ready:

In [12]:
endpoint = !oc get route llama-3-1-8b-instruct -n rhaiis-namespace -o jsonpath='{.spec.host}'

# URL is a list, access the first element
print(endpoint[0])

check_service_ready(url=endpoint[0])

llama-3-1-8b-instruct-xieshen-rhaiis.apps.ai-dev06.kni.syseng.devcluster.openshift.com
Checking service health endpoint...
✓ Service ready!


Test with custom parameters:

In [13]:
result = generate_text(
    url=endpoint[0],
    model="meta-llama/Llama-3.1-8B-Instruct",
    prompt="Write me a function that computes fibonacci in Javascript"
)
print(result if result else "Failed to generate text")

**Fibonacci Function in JavaScript**

Here is a simple function that calculates the Fibonacci sequence in JavaScript. The Fibonacci sequence is a series of numbers where a number is the sum of the two preceding ones, usually starting with 0 and 1.

```javascript
/**
 * Calculates the nth Fibonacci number.
 *
 * @param {number} n - The position of the Fibonacci number to calculate.
 * @returns {number} The nth Fibonacci number.
 */
function fibonacci(n) {
  if (n <= 0) {
    throw new Error("n must be a positive integer");
  }

  if (n === 1) {
    return 0;
  }

  if (n === 2) {
    return 1;
  }

  let a = 0;
  let b = 1;
  let result = 0;

  for (let i = 3; i <= n; i++) {
    result = a + b;
    a = b;
    b = result;
  }

  return result;
}
```

**Example Use Cases**
---------------------

```javascript
console.log(fibonacci(1)); // 0
console.log(fibonacci(2)); // 1
console.log(fibonacci(3)); // 1
console.log(fibonacci(4)); // 2
console.log(fibonacci(5)); // 3
console.log(fibonacci(

### Example 3: Deployment from Local Model

This example shows how to deploy Qwen2.5-0.5B from the locally downloaded model:

#### 1. Create a PersistentVolumeClaim (PVC) custom resource (CR) and apply it in the cluster. You use the PVC as the location where you store the models that you download.

In [None]:
!oc apply -f ./3_local_model/pvc.yaml -n rhaiis-namespace

#### 2. Download Model to PVC

We'll download Qwen2.5-0.5B, a lightweight LLM, for use in Example 3.

In [None]:
!oc apply -f ./3_local_model/download-job.yaml -n rhaiis-namespace

#### 3. Create a Deployment custom resource (CR) that uses the downloaded model from the PVC and deploys the Red Hat AI Inference Server container.

In [None]:
!oc apply -f ./3_local_model/deployment.yaml -n rhaiis-namespace

#### 4. Create a Service CR for the model inference. For example:

In [None]:
!oc apply -f ./3_local_model/service.yaml -n rhaiis-namespace

#### 5. Create a Route CR to enable public access to the model. For example:

In [None]:
!oc apply -f ./3_local_model/route.yaml

#### 6. Now let's test the deployed model:

In [14]:
endpoint = !oc get route qwen2-5-0-5b -n rhaiis-namespace -o jsonpath='{.spec.host}'

# URL is a list, access the first element
print(endpoint[0])

check_service_ready(url=endpoint[0])

qwen2-5-0-5b-xieshen-rhaiis.apps.ai-dev06.kni.syseng.devcluster.openshift.com
Checking service health endpoint...
✓ Service ready!


Test the local model deployment:





In [16]:
result = generate_text(
    url=endpoint[0],
    model="Qwen/Qwen2.5-0.5B",
    prompt="Tell me a story about a cat")
print(result if result else "Failed to generate text")

Once upon a time, there was a cat named Spotty. She was a very smart and funny animal who loved to play in the sun and chase after a ball across the grass. One day, Spotty’s owner, Mr. Baker, decided to take her for a walk in the park. As they walked, Spotty noticed a group of rabbits playing by the river. She decided to join them and soon found out that they were having a great time together. They played a game of fetch and Spotty even got to catch the rabbits’ ball in the process. After the walk, Spotty and Mr. Baker invited the rabbits to a picnic. They sat on a bench and enjoyed a delicious meal of carrots and apples. They even talked about their day and how they had a great time together. That was the moment when Spotty realized just how much they had in common and that they would always be friends.
