# Part 2: Local NIM Deployment

This notebook will guide you through deploying NVIDIA NIMs locally on your own infrastructure.

## Prerequisites

- NVIDIA GPU
- Docker installed
- NGC API Key

## 1. Environment Check

### Ensure GPU Availability

In [1]:
# Check GPU availability
!nvidia-smi

Wed Jul 16 18:04:51 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.163.01             Driver Version: 550.163.01     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          On  |   00000000:00:04.0 Off |                    0 |
| N/A   33C    P0             74W /  400W |       1MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                

This cell is performing a GPU availability test to verify that Docker can properly access the NVIDIA GPUs on the system. 

In [2]:
# Verify Docker and NVIDIA runtime
!docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

Unable to find image 'ubuntu:latest' locally
latest: Pulling from library/ubuntu

[1B12e3802c: Pull complete .72MB/29.72MBB[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2KDigest: sha256:c4570d2f4665d5d118ae29fb494dee4f8db8fcfaee0e37a2e19b827f399070d3
Status: Downloaded newer image for ubuntu:latest
Wed Jul 16 18:04:55 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.163.01             Driver Version: 550.163.01     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          On  |   00000

### Cache directory for NIM

We're creating a cache directory. This is important - models are LARGE (5-100GB). The cache means:
- Download once, use many times
- Survive container restarts
- Share models between containers
- Quick model switching

The cache will be at ~/.cache/nim

In [6]:
import os
import subprocess
import time
import requests
import json

# Set cache directory
LOCAL_NIM_CACHE = os.path.expanduser("~/.cache/nim")
os.makedirs(LOCAL_NIM_CACHE, exist_ok=True)
print(f"NIM cache directory: {LOCAL_NIM_CACHE}")

NIM cache directory: /root/.cache/nim


### Load API Keys

Load both the NGC and NVIDIA API Keys

In [9]:
import os
import requests
import json
from openai import OpenAI
from dotenv import load_dotenv
from pathlib import Path

# Find the .env file in the project root
env_path = Path('.env')

# Load environment variables from .env file
# Use override=True to ensure values are loaded even if they exist in environment
load_dotenv(dotenv_path=env_path, override=True)

# Get API keys from environment
nvidia_api_key = os.getenv("NVIDIA_API_KEY")
ngc_api_key = os.getenv("NGC_API_KEY")

# Check NVIDIA API Key
if not nvidia_api_key:
    print("❌ NVIDIA API Key not found in .env file!")
    print("👉 Please run 00_Workshop_Setup.ipynb first to set up your API key.")
    print(f"   (Looked for .env file at: {env_path.absolute()})")
    raise ValueError("NVIDIA_API_KEY not found. Please run the setup notebook first.")
else:
    print("✅ NVIDIA API Key loaded successfully from .env file")
    os.environ["NVIDIA_API_KEY"] = nvidia_api_key

# Check NGC API Keys
if not ngc_api_key:
    print("❌ NGC API Key not found in .env file!")
    print("👉 Please run 00_Workshop_Setup.ipynb first to set up your NGC API key.")
    raise ValueError("NGC_API_KEY not found. Please run the setup notebook first.")
else:
    print("✅ NGC API Key loaded successfully from .env file")
    os.environ["NGC_API_KEY"] = ngc_api_key

✅ NVIDIA API Key loaded successfully from .env file
✅ NGC API Key loaded successfully from .env file


## 2. NGC Authentication

Log into NGC, for pulling NIM Containers and other assets on nvcr.io

In [11]:
# Docker login to NGC
login_cmd = f'echo "{ngc_api_key}" | docker login nvcr.io --username \'$oauthtoken\' --password-stdin'
result = subprocess.run(login_cmd, shell=True, capture_output=True, text=True)
print("Login result:", result.stdout)

Login result: Login Succeeded



## 3. Deploy Your First NIM

Define container and image name

Clean existing containers if running

In [13]:
# Define deployment parameters
CONTAINER_NAME = "llama3.1-8b-instruct" 
IMG_NAME = "nvcr.io/nim/meta/llama-3.1-8b-instruct:latest"

# Stop existing container if running
!docker stop {CONTAINER_NAME} 2>/dev/null || true
!docker rm {CONTAINER_NAME} 2>/dev/null || true

## Running the container

### Docker run


This cell deploys the NIM container with the Llama 3.1 8B Instruct model. It constructs and executes a Docker command that:

**Container Configuration:**
- Runs in detached mode (`-d`) for background operation
- Enables GPU access with NVIDIA runtime
- Allocates 16GB shared memory for PyTorch operations
- Mounts the local cache directory to persist downloaded models
- Maps port 8000 for API access
- Runs with the current user's permissions to avoid file permission issues

**Key Environment Variables:**
- `ngc_api_key`: Authenticates with NVIDIA GPU Cloud to download the model
- `LOCAL_NIM_CACHE`: Points to `~/.cache/nim` for model storage

**Success/Failure Handling:**
- On success: Displays the container ID and confirms deployment

The first run will download the model (5-10 minutes), while subsequent runs use the cached model for faster startup (30-60 seconds).

In [14]:
# Start NIM container
docker_cmd = f"""
docker run -d --name={CONTAINER_NAME} \
    --runtime=nvidia \
    --gpus all \
    --shm-size=16GB \
    -e NGC_API_KEY={ngc_api_key} \
    -v {LOCAL_NIM_CACHE}:/opt/nim/.cache \
    -u $(id -u) \
    -p 8000:8000 \
    {IMG_NAME}
"""

print("Starting NIM container...")
result = subprocess.run(docker_cmd, shell=True, capture_output=True, text=True)

# Check if the command succeeded
if result.returncode == 0 and result.stdout.strip():
    container_id = result.stdout.strip()
    print(f"✅ Container started successfully!")
    print(f"Container ID: {container_id}")
else:
    print("❌ Failed to start container!")
    print(f"Return code: {result.returncode}")
    if result.stderr:
        print(f"Error message: {result.stderr}")
    if result.stdout:
        print(f"Output: {result.stdout}")
    
    # Common issues and solutions
    print("\nTroubleshooting tips:")
    print("1. Check if Docker is running: docker info")
    print("2. Check if image exists: docker images | grep llama")
    print("3. Check if port 8000 is already in use: docker ps -a")
    print(f"4. Check Docker logs: docker logs {CONTAINER_NAME}")
    print("5. Verify NGC authentication: echo $NGC_API_KEY")
    print("6. Check available disk space: df -h")
    print("7. Verify GPU is accessible: nvidia-smi")

Starting NIM container...


✅ Container started successfully!
Container ID: 56287a20a0092ce5c9ab7ef10cf1e8ca07425c3d84939b742688346787de3ce1


### Waiting for container to set up

This function polls the health endpoint until the NIM is ready. This may take a few minutes

[When ready appears]: The NIM is now ready to serve requests through the familiar OpenAI API format

If cell fails on timeoout, try running the cell again

In [15]:
def wait_for_nim_ready(max_attempts=60, sleep_time=5):
    """Wait for NIM to be ready to serve requests"""
    print("Waiting for NIM to start (this may take a few minutes on first run)...")
    
    # Get container IP
    import subprocess
    import json
    
    try:
        result = subprocess.run(['docker', 'inspect', CONTAINER_NAME], 
                              capture_output=True, text=True)
        container_info = json.loads(result.stdout)
        container_ip = container_info[0]['NetworkSettings']['IPAddress']
        health_url = f"http://{container_ip}:8000/v1/health/ready"
    except:
        health_url = "http://localhost:8000/v1/health/ready"  # fallback
    
    for attempt in range(max_attempts):
        try:
            response = requests.get(health_url)
            if response.status_code == 200:
                print("\n✅ NIM is ready!")
                return True
        except:
            pass
        
        print(".", end="", flush=True)
        time.sleep(sleep_time)
    
    print("\n❌ NIM failed to start")
    return False

# Wait for container to be ready
if wait_for_nim_ready():
    print("NIM is ready to serve requests!")
else:
    print("Check logs with: docker logs", CONTAINER_NAME)

Waiting for NIM to start (this may take a few minutes on first run)...
...........................................
✅ NIM is ready!
NIM is ready to serve requests!


### Get container's IP

Get the container's IP address
- localhost might not work on GPU cloud instances
- we will directly connect to the container's network via the IP addr
- The cell then verifies the NIM is working by requesting the available models list, confirming the API is ready to serve requests.

In [16]:
import subprocess
import json

# Get container IP address
def get_container_ip(container_name):
    try:
        result = subprocess.run(['docker', 'inspect', container_name], 
                              capture_output=True, text=True)
        if result.returncode == 0:
            container_info = json.loads(result.stdout)
            ip = container_info[0]['NetworkSettings']['IPAddress']
            print(f"Container IP: {ip}")
            return ip
        else:
            print(f"Failed to get container info for '{container_name}'")
            print(f"Error: {result.stderr}")
            return None
    except Exception as e:
        print(f"Error getting container IP: {e}")
        return None

container_ip = get_container_ip(CONTAINER_NAME)

# If we have the IP, try connecting to it directly
if container_ip:
    try:
        response = requests.get(f"http://{container_ip}:8000/v1/models", timeout=5)
        if response.status_code == 200:
            print("✅ NIM is accessible via container IP!")
            print("Available models:", response.json())
        else:
            print(f"❌ Got status code {response.status_code} from container IP")
    except Exception as e:
        print(f"❌ Error connecting to container IP: {e}")

Container IP: 172.17.0.3
✅ NIM is accessible via container IP!
Available models: {'object': 'list', 'data': [{'id': 'meta/llama-3.1-8b-instruct', 'object': 'model', 'created': 1752690836, 'owned_by': 'system', 'root': 'meta/llama-3.1-8b-instruct', 'parent': None, 'max_model_len': 131072, 'permission': [{'id': 'modelperm-162147f1ab4e4f9e81686c4e70c4e153', 'object': 'model_permission', 'created': 1752690836, 'allow_create_engine': False, 'allow_sampling': True, 'allow_logprobs': True, 'allow_search_indices': False, 'allow_view': True, 'allow_fine_tuning': False, 'organization': '*', 'group': None, 'is_blocking': False}]}]}


## 4. Test Local NIM

Run our NIM locally
- points to local container instead of the cloud

Try switching between cloud and local NIMs:
- by simply changing 'base_url'

This demonstrates the power of NIMs: same API, same code, but now running on your own hardware.

In [26]:
from openai import OpenAI
import subprocess
import json

# Create OpenAI client pointing to your local NIM
client = OpenAI(
    base_url=f"http://{container_ip}:8000/v1",
    api_key="not-needed-for-local",  # Local NIM doesn't require autu
)

# You can reference how we called this model via the API
# client = OpenAI(
#     base_url="https://integrate.api.nvidia.com/v1",
#     api_key=nvidia_api_key
# )

# Example: Streaming response
stream = client.chat.completions.create(
    model="meta/llama-3.1-8b-instruct",
    messages=[
        {"role": "user", "content": "Write a short poem about AI"}
    ],
    stream=True
)

print("Streaming response:")
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Streaming response:
Here is a short poem about AI:

Metal minds with circuits bright,
Think and learn through endless night.
Synthetic intelligence, a wondrous thing,
Processing data, with computations that sing.

With logic guiding, and codes unseen,
AI navigates the digital scene.
From silicon dreams to futuristic depth,
A new era unfolds, a world to keep.

Yet as it rises, with power and might,
We question ethics, and what's in sight.
Will it aid humanity, or pave its own way?
Only time will tell, in a technological day.

Note on output: Your poem may be different from the example shown because of the different parameters

## Models availale in local NIM deployment

We call the `/v1/models` endpoint to list available models.

This confirms the model is loaded and ready.

Helps verify the correct model name for inference.

In [27]:
# Check available models
response = requests.get(f"http://{container_ip}:8000/v1/models")
models = response.json()
print("Available models:")
print(json.dumps(models, indent=2))

Available models:
{
  "object": "list",
  "data": [
    {
      "id": "meta/llama-3.1-8b-instruct",
      "object": "model",
      "created": 1752691199,
      "owned_by": "system",
      "root": "meta/llama-3.1-8b-instruct",
      "parent": null,
      "max_model_len": 131072,
      "permission": [
        {
          "id": "modelperm-ab62b432d07f4ac7bebaa2d18db627c8",
          "object": "model_permission",
          "created": 1752691199,
          "allow_create_engine": false,
          "allow_sampling": true,
          "allow_logprobs": true,
          "allow_search_indices": false,
          "allow_view": true,
          "allow_fine_tuning": false,
          "organization": "*",
          "group": null,
          "is_blocking": false
        }
      ]
    }
  ]
}


## 9. Clean Up

Before we move on to LoRA fine-tuning, let's properly clean up our deployment. This is important for a few reasons:

1. Frees up GPU memory for our next activities
2. Prevents port conflicts if we redeploy

Don't worry - the model remains cached, so if you want to restart this NIM later, it'll start up in seconds, not minutes.

In [28]:
# Stop and remove container
!docker stop {CONTAINER_NAME}
!docker rm {CONTAINER_NAME}

llama3.1-8b-instruct
llama3.1-8b-instruct


## Summary

You've learned how to:
- Deploy NIMs locally with Docker
- Test deployments

Next: Let's explore LoRA fine-tuning with NeMo!