# Deploy HuggingFace with RHAIIS on OpenShift

## What's Covered

This tutorial includes:
* **Example 1**: Basic deployment from HuggingFace (5 minutes)
* **Example 2**: Customizing model parameters
* **Example 3**: Deploying from local storage for offline use


## Setup: Python SDK with uv for IntelliJ

### Prerequisites Setup

This notebook uses `uv` for Python dependency management. Follow these steps to set up your environment:

#### 1. Initialize uv project
```bash
uv init --no-readme
```

#### 2. Add Jupyter dependencies to pyproject.toml
```bash
uv add jupyter notebook ipykernel ipywidgets
```

Or manually edit `pyproject.toml`:
```toml
dependencies = [
    "jupyter>=1.0.0",
    "notebook>=7.0.0",
    "ipykernel>=6.0.0",
    "ipywidgets>=8.0.0",
]
```

#### 3. Sync/install dependencies
```bash
uv sync
```

This creates a `.venv` virtual environment and installs all packages.

#### 4. Configure IntelliJ IDEA
1. Open **File → Project Structure → Project**
2. Click **SDK** → **Add SDK** → **Python SDK**
3. Select **Virtualenv Environment** → **Existing environment**
4. Browse to: `rhaiis-poc/.venv/bin/python`
5. Click **OK**

#### 5. Use the SDK in this notebook
1. In IntelliJ, select the Python interpreter (the one you just configured)
2. The notebook will use the Jupyter kernel from your `.venv`

---


### Utility Functions

Below are some utility functions we'll use in this notebook. These are for simplifying the process of deploying and monitoring NIMs in a notebook environment, and aren't required in general.



In [7]:
import requests
import time

def check_service_ready(url):
    """Fallback health check using HTTP endpoint"""
    url = f"http://{url}/health"
    print("Checking service health endpoint...")

    while True:
        try:
            response = requests.get(url, headers={'accept': 'application/json'})
            if response.status_code == 200 :
                print("✓ Service ready!")
                break
        except requests.ConnectionError:
            pass
        print("⏳ Still starting...")
        time.sleep(30)

def generate_text(url, model, prompt, max_tokens=1000, temperature=0.7):
    """Generate text using the NIM service"""
    try:
        response = requests.post(
            f"http://{url}/v1/chat/completions",
            json={
                "model": model,
                "messages": [{"role": "user", "content": prompt}],
                "max_tokens": max_tokens,
                "temperature": temperature
            },
            timeout=60
        )
        response.raise_for_status()
        return response.json()['choices'][0]['message']['content']
    except requests.exceptions.RequestException as e:
        print(f"Error making request: {e}")
        return None

print("✓ Utility functions loaded successfully")

✓ Utility functions loaded successfully


## Deployment Examples

Let's explore different ways to deploy models using NIM.

### Example 1: Basic Deployment from Hugging Face

This example shows how to deploy Llama-3.1-8B-Instruct with default settings directly from Hugging Face.


#### 1. Create the Secret custom resource (CR) for the Hugging Face token. The cluster uses the Secret CR to pull models from Hugging Face.

1.1 Set the HF_TOKEN variable using the token you set in Hugging Face.

In [None]:
!HF_TOKEN=<your_huggingface_token>

1.2 Set the cluster namespace to match where you deployed the Red Hat AI Inference Server image, for example:

In [None]:
!NAMESPACE=rhaiis-namespace

1.3 Create the Secret CR in the cluster:

In [None]:
!oc create secret generic hf-secret --from-literal=HF_TOKEN=$HF_TOKEN -n $NAMESPACE

#### 2. Create the Docker secret so that the cluster can download the Red Hat AI Inference Server image from the container registry. For example, to create a Secret CR that contains the contents of your local ~/.docker/config.json file, run the following command:

In [None]:
!oc create secret generic docker-secret --from-file=.dockercfg=$HOME/.docker/config.json --type=kubernetes.io/dockercfg -n rhaiis-namespace

#### 3. Create a PersistentVolumeClaim (PVC) custom resource (CR) and apply it in the cluster. The following example PVC CR uses a default IBM VPC Block persistence volume. You use the PVC as the location where you store the models that you download.

In [None]:
!oc apply -f ./1_basic/pvc.yaml -n rhaiis-namespace

#### 4. Create a Deployment custom resource (CR) that pulls the model from Hugging Face and deploys the Red Hat AI Inference Server container. Reference the following example Deployment CR, which uses AI Inference Server to serve a Granite model on a CUDA accelerator.

In [None]:
!oc apply -f ./1_basic/deployment.yaml -n rhaiis-namespace

#### 5. Create a Service CR for the model inference. For example:

In [None]:
!oc apply -f ./1_basic/service.yaml -n rhaiis-namespace

#### 6. Create a Route CR to enable public access to the model. For example:

In [None]:
!oc apply -f ./1_basic/route.yaml -n rhaiis-namespace

#### 7. Now let's test the deployed model:

Check if service is ready:

In [9]:
endpoint = !oc get route llama-3-1-8b-instruct -n rhaiis-namespace -o jsonpath='{.spec.host}'

# URL is a list, access the first element
print(endpoint[0])

check_service_ready(url=endpoint[0])


Logged into "https://api.ai-dev06.kni.syseng.devcluster.openshift.com:6443" as "xiezhang@redhat.com" using the token provided.

You have access to 90 projects, the list has been suppressed. You can list all projects with 'oc projects'

Using project "default".
llama-3-1-8b-instruct-xieshen-rhaiis.apps.ai-dev06.kni.syseng.devcluster.openshift.com
Checking service health endpoint...
✓ Service ready!


Test the Service:

In [10]:
result = generate_text(
    url=endpoint[0],
    model="meta-llama/Llama-3.1-8B-Instruct",
    prompt="Write a complete function that computes fibonacci numbers in Rust"
)
print(result if result else "Failed to generate text")

**Fibonacci Function in Rust**

Here is a simple function that computes Fibonacci numbers in Rust. This function uses recursion, which is a common approach for calculating Fibonacci numbers. However, please note that recursion can lead to a stack overflow for large inputs.

```rust
/// Calculate the nth Fibonacci number using recursion.
fn fibonacci_recursive(n: u32) -> u32 {
    match n {
        0 => 0,
        1 => 1,
        _ => fibonacci_recursive(n - 1) + fibonacci_recursive(n - 2),
    }
}

/// Calculate the nth Fibonacci number using iteration.
fn fibonacci_iterative(n: u32) -> u32 {
    if n <= 1 {
        return n;
    }

    let mut a = 0;
    let mut b = 1;
    for _ in 2..=n {
        let temp = a + b;
        a = b;
        b = temp;
    }
    b
}

fn main() {
    let n = 10; // Change this to the desired Fibonacci number
    println!("Fibonacci number at index {} (recursive): {}", n, fibonacci_recursive(n));
    println!("Fibonacci number at index {} (iterative): {}", n

### Example 2: Deployment Using Different Backend Options

NIM supports multiple backends for model deployment. Let's explore TensorRT-LLM and vLLM backends:

#### TensorRT-LLM Backend




In [None]:
# Using TensorRT-LLM backend by specifying the NIM_MODEL_PROFILE parameter
!docker run -it --rm \
 --name=$CONTAINER_NAME \
 --runtime=nvidia \
 --gpus all \
 --shm-size=16GB \
 -e HF_TOKEN=$HF_TOKEN \
 -e NIM_MODEL_NAME="hf://mistralai/Codestral-22B-v0.1" \
 -e NIM_SERVED_MODEL_NAME="mistralai/Codestral-22B-v0.1" \
 -e NIM_MODEL_PROFILE="tensorrt_llm" \
 -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
 -u $(id -u) \
 -p 8000:8000 \
 -d \
 $NIM_IMAGE

In [None]:
check_service_ready_from_logs(os.environ["CONTAINER_NAME"], print_logs=True)

Test the TensorRT-LLM backend:




In [None]:
result = generate_text(
    model="mistralai/Codestral-22B-v0.1",
    prompt="Write a complete Python function that computes fibonacci numbers with memoization"
)
print("TensorRT-LLM Backend Result:")
print("=" * 50)
print(result if result else "Failed to generate text")

Before we move onto the next example, let's stop the LLM NIM service.





In [None]:
!docker stop $CONTAINER_NAME 2>/dev/null || echo "Container already stopped"

#### vLLM Backend




In [None]:
# Using vLLM backend by specifying the NIM_MODEL_PROFILE parameter
!docker run -it --rm \
 --name=$CONTAINER_NAME \
 --runtime=nvidia \
 --gpus all \
 --shm-size=16GB \
 -e HF_TOKEN=$HF_TOKEN \
 -e NIM_MODEL_NAME="hf://mistralai/Codestral-22B-v0.1" \
 -e NIM_SERVED_MODEL_NAME="mistralai/Codestral-22B-v0.1" \
 -e NIM_MODEL_PROFILE="vllm" \
 -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
 -u $(id -u) \
 -p 8000:8000 \
 -d \
 $NIM_IMAGE

In [None]:
check_service_ready_from_logs(os.environ["CONTAINER_NAME"], print_logs=True)

Test the vLLM backend:





In [None]:
result = generate_text(
    model="mistralai/Codestral-22B-v0.1",
    prompt="Write a complete C++ function that computes fibonacci numbers efficiently"
)
print("vLLM Backend Result:")
print("=" * 50)
print(result if result else "Failed to generate text")

Before we move onto the next example, let's stop the LLM NIM service.





In [None]:
!docker stop $CONTAINER_NAME 2>/dev/null || echo "Container already stopped"

### Example 3: Customizing Model Parameters

This example demonstrates how custom parameters affect model behavior. We'll deploy with specific constraints and test them:

**Key Parameters:**
* `NIM_TENSOR_PARALLEL_SIZE=2`: Uses 2 GPUs in parallel for better performance
* `NIM_MAX_INPUT_LENGTH=2048`: Limits input to 2048 tokens
* `NIM_MAX_OUTPUT_LENGTH=512`: Limits output to 512 tokens

<div class="alert alert-block alert-info">
    <b>Note:</b> You must have at least 2 GPUs to run the following cell. If you don't have at least 2 GPUs, modify the <code>NIM_TENSOR_PARALLEL_SIZE</code> paramater in the cell below.
</div>


In [None]:
!docker run -it --rm \
 --name=$CONTAINER_NAME \
 --runtime=nvidia \
 --gpus all \
 --shm-size=16GB \
 -e HF_TOKEN=$HF_TOKEN \
 -e NIM_MODEL_NAME="hf://mistralai/Codestral-22B-v0.1" \
 -e NIM_SERVED_MODEL_NAME="mistralai/Codestral-22B-v0.1" \
 -e NIM_TENSOR_PARALLEL_SIZE=2 \
 -e NIM_MAX_INPUT_LENGTH=2048 \
 -e NIM_MAX_OUTPUT_LENGTH=512 \
 -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
 -u $(id -u) \
 -p 8000:8000 \
 -d \
 $NIM_IMAGE

In [None]:
# Use the log-based check (set print_logs=True to see detailed logs)
check_service_ready_from_logs(os.environ["CONTAINER_NAME"], print_logs=True)

Test with custom parameters:





In [None]:
result = generate_text(model="mistralai/Codestral-22B-v0.1",
                       prompt="Write me a function that computes fibonacci in Javascript")
print(result if result else "Failed to generate text")

Before we move onto the next example, let's stop the LLM NIM service.





In [None]:
!docker stop $CONTAINER_NAME 2>/dev/null || echo "Container already stopped"

### Example 4: Deployment from Local Model

This example shows how to deploy Qwen2.5-0.5B from the locally downloaded model:

#### Download Model to Local Storage

We'll download Qwen2.5-0.5B, a lightweight LLM, for use in Example 4.

<div class="alert alert-block alert-info">
<b>Note:</b> You can modify the `model_save_location` variable below to use a different directory for storing downloaded models.
</div>




In [None]:
# Set up local model directory
model_save_location = os.path.join(base_work_dir, "models")
local_model_name = "Qwen2.5-0.5B-Instruct"
local_model_path = os.path.join(model_save_location, local_model_name)
os.makedirs(local_model_path, exist_ok=True)

os.environ["LOCAL_MODEL_DIR"] = local_model_path

In [None]:
!huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct --local-dir "$LOCAL_MODEL_DIR" && echo "✓ Model downloaded successfully"

In [None]:
# Verify model files exist
!ls -Rlh "$LOCAL_MODEL_DIR"

In [None]:
!docker run -it --rm \
 --name=$CONTAINER_NAME \
 --runtime=nvidia \
 --gpus '"device=0"' \
 --shm-size=16GB \
 -e NIM_MODEL_NAME="/opt/models/local_model" \
 -e NIM_SERVED_MODEL_NAME="Qwen/Qwen2.5-0.5B" \
 -v "$LOCAL_MODEL_DIR:/opt/models/local_model" \
 -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
 -u $(id -u) \
 -p 8000:8000 \
 -d \
 $NIM_IMAGE

In [None]:
# Use the log-based check (set print_logs=True to see detailed logs)
check_service_ready_from_logs(os.environ["CONTAINER_NAME"], print_logs=True)

Test the local model deployment:





In [None]:
result = generate_text(model="Qwen/Qwen2.5-0.5B",
                       prompt="Tell me a story about a cat")
print(result if result else "Failed to generate text")

In [None]:
# Final cleanup
!docker stop $CONTAINER_NAME 2>/dev/null || echo "Container already stopped"
print("✓ All containers stopped successfully")