# Part 4: Deploying LoRA Adapters with NVIDIA NIM

This notebook demonstrates how to deploy your trained LoRA adapters using NVIDIA NIM. We'll cover:
- Understanding NIM's LoRA deployment architecture
- Using Docker volumes for reliable LoRA mounting
- Testing your deployed LoRA adapter

## Prerequisites

Before starting, ensure you have:
1. Completed notebook 03 (LoRA training) - you should have a `.nemo` file
2. Docker installed with GPU support
3. Your NGC API key ready
4. At least 20GB of free disk space

## Understanding NIM LoRA Deployment

### How NIM Handles LoRA Adapters

NVIDIA NIM supports dynamic LoRA loading through:
- **NIM_PEFT_SOURCE**: Environment variable pointing to your LoRA directory
- **Automatic Discovery**: NIM scans for `.nemo` files in subdirectories
- **Hot Reloading**: With `NIM_PEFT_REFRESH_INTERVAL`, NIM checks for new adapters

### Expected Directory Structure

```
NIM_PEFT_SOURCE/
├── adapter1/
│   └── adapter1.nemo
├── adapter2/
│   └── adapter2.nemo
└── adapter3/
    └── adapter3.nemo
```

Each adapter must be in its own subdirectory!

### Load NGC Key

In [1]:
# Setup and imports
import os
import subprocess
import time
import requests
import json
from pathlib import Path

# Load environment variables from .env file
try:
    from dotenv import load_dotenv
    load_dotenv()
except ImportError:
    # If python-dotenv is not installed, try to read .env manually
    if os.path.exists('.env'):
        with open('.env', 'r') as f:
            for line in f:
                if '=' in line:
                    key, value = line.strip().split('=', 1)
                    os.environ[key] = value

# Set up environment
NGC_API_KEY = os.getenv('NGC_API_KEY')
if not NGC_API_KEY:
    print("⚠️  NGC_API_KEY not found in environment or .env file!")
    print("Please run the Workshop Setup notebook (00_Workshop_Setup.ipynb) first.")
else:
    os.environ['NGC_API_KEY'] = NGC_API_KEY

print(f"NGC API Key configured: {'✓' if NGC_API_KEY else '✗'}")
print(f"Working directory: {os.getcwd()}")

NGC API Key configured: ✓
Working directory: /root/verb-workspace/NIM-build-tune-deploy-participant


### Log into NGC

In [2]:
# Docker login to NGC
login_cmd = f'echo "{NGC_API_KEY}" | docker login nvcr.io --username \'$oauthtoken\' --password-stdin'
result = subprocess.run(login_cmd, shell=True, capture_output=True, text=True)

if "Login Succeeded" in result.stdout:
    print("✓ Successfully logged in to NGC")
else:
    print("✗ Login failed!")
    print("Error:", result.stderr)

✓ Successfully logged in to NGC


## Step 1: Prepare Your LoRA Adapter

First, let's check that your LoRA adapter is ready for deployment.

Now let's make sure your LoRA adapter is ready. We're looking for the .nemo file from notebook 03.

In [4]:
# Check for LoRA files
lora_paths = [
    "lora_tutorial/experiments/customer_support_lora/checkpoints/customer_support_lora.nemo",
    "loras/customer_support_lora/customer_support_lora.nemo"
]

lora_file = None
for path in lora_paths:
    if os.path.exists(path):
        lora_file = path
        print(f"✓ Found LoRA adapter: {path}")
        print(f"  Size: {os.path.getsize(path) / 1024 / 1024:.2f} MB")
        break

if not lora_file:
    print("✗ No LoRA adapter found!")
    print("\nPlease ensure you've completed notebook 03 and have a .nemo file.")
    print("Expected locations:")
    for path in lora_paths:
        print(f"  - {path}")

✓ Found LoRA adapter: lora_tutorial/experiments/customer_support_lora/checkpoints/customer_support_lora.nemo
  Size: 20.04 MB


This cell prepares the LoRA adapter for deployment by creating the required directory structure (`loras/customer_support_lora`) and copying the trained LoRA file into it, which NIM expects for loading custom adapters.

In [5]:
# Create proper directory structure for NIM
!mkdir -p loras/customer_support_lora

# Copy LoRA file if needed
if lora_file and not os.path.exists("loras/customer_support_lora/customer_support_lora.nemo"):
    !cp {lora_file} loras/customer_support_lora/
    print("✓ Copied LoRA adapter to deployment directory")

# Verify structure
!echo "LoRA deployment structure:"
!tree loras/ 2>/dev/null || find loras/ -type f -name "*.nemo" | head -10

✓ Copied LoRA adapter to deployment directory
LoRA deployment structure:
loras/customer_support_lora/customer_support_lora.nemo


## Step 2: Clean Up Existing Resources

Before deploying, let's ensure we have a clean slate.

In [6]:
# Configuration
CONTAINER_NAME = "llama3.1-lora-nim-volume"
VOLUME_NAME = "nim-lora-adapters"
IMAGE_NAME = "nvcr.io/nim/meta/llama-3.1-8b-instruct:latest"

# Clean up any existing resources
print("🧹 Cleaning up existing resources...")
!docker rm -f {CONTAINER_NAME} 2>/dev/null || true
!docker volume rm {VOLUME_NAME} 2>/dev/null || true

print("\n✓ Cleanup complete")

🧹 Cleaning up existing resources...

✓ Cleanup complete


## Step 3: Create Docker Volume and Copy LoRA Files

This cell creates a Docker volume and copies the LoRA adapter file into it using two methods:

1. **First attempt**: Uses a bind mount to copy files directly
2. **Fallback method**: If that fails (common on cloud), uses `docker cp` with a temporary container

The volume is needed because cloud GPU instances often have issues with direct file mounting, so Docker volumes provide a reliable way to make the LoRA file available to the NIM container.

<details>

#### Important: Cloud GPU Deployment Challenges

#### The Bind Mount Problem

On cloud GPU instances, Docker bind mounts often fail due to:
- Storage driver incompatibilities
- Security policies
- Network file systems

**Symptoms:**
- Mounted directories appear empty inside containers
- Files exist on host but not visible in container
- No error messages, just empty directories

#### The Solution: Docker Volumes

Docker named volumes work reliably where bind mounts fail. We'll use this approach throughout the notebook.

</details>

In [7]:
# Create Docker volume
print("📦 Creating Docker volume for LoRA adapters...")
!docker volume create {VOLUME_NAME}

# Copy LoRA files to the volume using a temporary container
print("\n📋 Copying LoRA files to Docker volume...")
# Method 1: Try with bind mount first
copy_result = subprocess.run(
    f'docker run --rm -v {VOLUME_NAME}:/data -v $(pwd)/loras:/source alpine sh -c '
    f'"mkdir -p /data/customer_support_lora && '
    f'cp -r /source/customer_support_lora/* /data/customer_support_lora/ 2>/dev/null || echo \'No files to copy\' && '
    f'ls -la /data/customer_support_lora/"',
    shell=True, capture_output=True, text=True
)
print(copy_result.stdout)

# Method 2: Use docker cp as fallback (more reliable on cloud)
print("\n📋 Ensuring LoRA files are in the volume (using docker cp)...")
!docker run -d --name temp-container -v {VOLUME_NAME}:/data alpine sleep 3600
!docker cp loras/customer_support_lora/customer_support_lora.nemo temp-container:/data/customer_support_lora/
!docker exec temp-container ls -la /data/customer_support_lora/
!docker rm -f temp-container

print("\n✓ LoRA files copied to volume")

📦 Creating Docker volume for LoRA adapters...
nim-lora-adapters

📋 Copying LoRA files to Docker volume...
No files to copy
total 8
drwxr-xr-x    2 root     root          4096 Jul 16 19:23 .
drwxr-xr-x    3 root     root          4096 Jul 16 19:23 ..


📋 Ensuring LoRA files are in the volume (using docker cp)...
a7f937c67a85b52aa5465cb0266e1a93b666a6c4819aaa16dc318bbb623e7294
Successfully copied 21MB to temp-container:/data/customer_support_lora/
total 20528
drwxr-xr-x    2 root     root          4096 Jul 16 19:23 .
drwxr-xr-x    3 root     root          4096 Jul 16 19:23 ..
-rw-r--r--    1 root     root      21012480 Jul 16 19:20 customer_support_lora.nemo
temp-container

✓ LoRA files copied to volume


This output shows the two-step copying process:

**First attempt failed**: "No files to copy" - the bind mount method didn't work (common on cloud GPUs)

**Second attempt succeeded**: 
- "Successfully copied 21MB" - the `docker cp` method worked
- The LoRA file (`customer_support_lora.nemo`, 21MB) is now in the Docker volume
- Ready for NIM to use

This is why the cell uses two methods - the first one often fails on cloud, but the second one (docker cp) is more reliable.

## Step 4: Start NIM Container with LoRA Support

This cell starts the NIM container with LoRA support enabled.

**Key configuration:**
- `NIM_PEFT_SOURCE=/lora-store` - tells NIM where to find LoRA adapters
- `NIM_PEFT_REFRESH_INTERVAL=300` - checks for new LoRAs every 5 minutes
- `-v {VOLUME_NAME}:/lora-store` - mounts the Docker volume containing your LoRA file

The container runs in the background with GPU access and will take 2-3 minutes to initialize on first run as it loads the base model and discovers the LoRA adapter.

In [8]:
# Start NIM container with LoRA support
docker_cmd = f"""
docker run -d \\
    --name={CONTAINER_NAME} \\
    --runtime=nvidia \\
    --gpus all \\
    --shm-size=16GB \\
    -e NGC_API_KEY={NGC_API_KEY} \\
    -e NIM_PEFT_SOURCE=/lora-store \\
    -e NIM_PEFT_REFRESH_INTERVAL=300 \\
    -v {VOLUME_NAME}:/lora-store \\
    -p 8000:8000 \\
    {IMAGE_NAME}
"""

print("🚀 Starting NIM container with LoRA support...")
print(f"\nCommand: {docker_cmd}")

result = subprocess.run(docker_cmd, shell=True, capture_output=True, text=True)
if result.returncode == 0:
    container_id = result.stdout.strip()
    print(f"\n✓ Container started: {container_id[:12]}")
    print("\n⏳ Container is initializing. This may take 2-3 minutes on first run...")
else:
    print("\n✗ Failed to start container")
    print("Error:", result.stderr)

🚀 Starting NIM container with LoRA support...

Command: 
docker run -d \
    --name=llama3.1-lora-nim-volume \
    --runtime=nvidia \
    --gpus all \
    --shm-size=16GB \
    -e NGC_API_KEY=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 \
    -e NIM_PEFT_SOURCE=/lora-store \
    -e NIM_PEFT_REFRESH_INTERVAL=300 \
    -v nim-lora-adapters:/lora-store \
    -p 8000:8000 \
    nvcr.io/nim/meta/llama-3.1-8b-instruct:latest


✓ Container started: 3af54eec9038

⏳ Container is initializing. This may take 2-3 minutes on first run...


## Step 5: Get Container Information

Let's find our container's IP address. On cloud instances, 'localhost' might not work, so we get the actual container IP.

You'll see something like 172.17.0.3 - that's Docker's internal network."

In [9]:
# Get container IP address
def get_container_ip():
    try:
        result = subprocess.run(
            f"docker inspect -f '{{{{range.NetworkSettings.Networks}}}}{{{{.IPAddress}}}}{{{{end}}}}' {CONTAINER_NAME}",
            shell=True, capture_output=True, text=True
        )
        ip = result.stdout.strip()
        return ip if ip else "localhost"
    except:
        return "localhost"

container_ip = get_container_ip()
print(f"📍 Container IP: {container_ip}")
base_url = f"http://{container_ip}:8000"

# Verify LoRA files are visible inside container
print("\n🔍 Verifying LoRA files in container...")
!docker exec {CONTAINER_NAME} ls -la /lora-store/customer_support_lora/ || echo "Container still starting..."

📍 Container IP: 172.17.0.3

🔍 Verifying LoRA files in container...
total 20528
drwxr-xr-x 2 root root     4096 Jul 16 19:23 .
drwxr-xr-x 3 root root     4096 Jul 16 19:23 ..
-rw-r--r-- 1 root root 21012480 Jul 16 19:20 customer_support_lora.nemo


## Step 6: Wait for NIM to Initialize

NIM needs time to initialize:
- Loading the base model
- Scanning for LoRA adapters
- Optimizing for your GPU

In [10]:
# Wait for NIM to be ready
def wait_for_nim(base_url, timeout=300):
    print("⏳ Waiting for NIM to initialize...")
    start_time = time.time()
    
    while time.time() - start_time < timeout:
        try:
            response = requests.get(f"{base_url}/v1/health/ready", timeout=2)
            if response.status_code == 200:
                print("\n✅ NIM is ready!")
                return True
        except:
            pass
        
        print(".", end="", flush=True)
        time.sleep(5)
    
    print("\n✗ Timeout waiting for NIM")
    return False

if wait_for_nim(base_url):
    # Check logs for LoRA loading
    print("\n📋 Checking LoRA synchronization logs...")
    !docker logs {CONTAINER_NAME} 2>&1 | grep -i "lora\|peft\|adapter\|synchroniz" | tail -20
else:
    print("\n⚠️  NIM is taking longer than expected. Checking logs...")
    !docker logs {CONTAINER_NAME} 2>&1 | tail -30

⏳ Waiting for NIM to initialize...
..................
✅ NIM is ready!

📋 Checking LoRA synchronization logs...
INFO 2025-07-16 19:25:39.137 ngc_profile.py:333] Running NIM with LoRA enabled. Only looking for compatible profiles that support LoRA.
INFO 2025-07-16 19:25:39.137 ngc_injector.py:159] Valid profile: 7b8458eb682edb0d2a48b4019b098ba0bfbc4377aadeeaa11b346c63c7adf724 (tensorrt_llm-trtllm_buildable-bf16-tp1-pp1-lora) on GPUs [0]
INFO 2025-07-16 19:25:39.137 ngc_injector.py:159] Valid profile: f749ba07aade1d9e1c36ca1b4d0b67949122bd825e8aa6a52909115888a34b95 (vllm-bf16-tp1-pp1-lora) on GPUs [0]
INFO 2025-07-16 19:25:39.137 ngc_injector.py:315] Selected profile: 7b8458eb682edb0d2a48b4019b098ba0bfbc4377aadeeaa11b346c63c7adf724 (tensorrt_llm-trtllm_buildable-bf16-tp1-pp1-lora)
INFO 2025-07-16 19:25:39.138 ngc_injector.py:323] Profile metadata: feat_lora: true
INFO 2025-07-16 19:25:39.138 ngc_injector.py:323] Profile metadata: feat_lora_max_rank: 32
INFO 2025-07-16 19:26:54.917 launch.

## Step 7: Verify Available Models

The moment of truth! Let's see what models are available.

[RUN THE CELL]

You should see TWO models:
1. meta/llama3-8b-instruct - the base model
2. customer_support_lora - your fine-tuned adapter

If you only see the base model, give it another minute and run again. NIM might still be scanning.

In [11]:
# Check available models
try:
    response = requests.get(f"{base_url}/v1/models")
    if response.status_code == 200:
        models = response.json()
        print("📋 Available models:")
        print("=" * 50)
        
        model_count = len(models.get('data', []))
        print(f"\nFound {model_count} model(s):\n")
        
        for model in models.get('data', []):
            model_id = model.get('id', 'unknown')
            print(f"  • {model_id}")
            if model_id == "meta/llama3-8b-instruct":
                print("    Type: Base model")
            elif "lora" in model_id.lower():
                print("    Type: LoRA adapter")
                print("    ✨ Your custom model is ready!")
        
        if model_count == 1:
            print("\n⚠️  Only base model found. LoRA may still be loading...")
            print("Wait 30 seconds and run this cell again.")
        elif model_count > 1:
            print("\n✅ Both base model and LoRA adapter are available!")
    else:
        print(f"Error: Status code {response.status_code}")
except Exception as e:
    print(f"Error connecting to NIM: {e}")
    print("\nMake sure the container is running and healthy.")

📋 Available models:

Found 2 model(s):

  • meta/llama-3.1-8b-instruct
  • customer_support_lora
    Type: LoRA adapter
    ✨ Your custom model is ready!

✅ Both base model and LoRA adapter are available!


## Step 8: Test Your LoRA Adapter

Now for the exciting part - let's test both models with the same query!

Watch the difference:
- Base model: Generic, helpful but not specific
- LoRA model: Uses your training data, knows your policies

This is the power of fine-tuning - domain-specific responses without training a whole new model!

In [12]:
# Test function
def test_model(model_name, query):
    """Test a model with a query"""
    url = f"{base_url}/v1/chat/completions"
    
    data = {
        "model": model_name,
        "messages": [
            # {
            #     "role": "system",
            #     "content": "You are a helpful customer support assistant."
            # },
            {
                "role": "user",
                "content": query
            }
        ],
        "max_tokens": 150,
        "temperature": 0.7
    }
    
    try:
        response = requests.post(url, json=data, timeout=30)
        if response.status_code == 200:
            return response.json()['choices'][0]['message']['content']
        else:
            return f"Error: {response.status_code} - {response.text}"
    except Exception as e:
        return f"Error: {e}"

# Test query
test_query = "My order hasn't arrived yet. Order number is 12345."

print("🧪 Test Query:", test_query)
print("=" * 70)

# Test base model
print("\n🤖 BASE MODEL RESPONSE:")
print("-" * 70)
base_response = test_model("meta/llama-3.1-8b-instruct", test_query)
print(base_response)

# Test LoRA model
print("\n🎯 LORA MODEL RESPONSE:")
print("-" * 70)
lora_response = test_model("customer_support_lora", test_query)
print(lora_response)

print("\n" + "=" * 70)
print("💡 Notice the difference? The LoRA model provides more specific,")
print("   policy-aware responses based on your training data!")

🧪 Test Query: My order hasn't arrived yet. Order number is 12345.

🤖 BASE MODEL RESPONSE:
----------------------------------------------------------------------
I'd be happy to help you track down your order. To assist you further, I'll need to know a few more details. Here are a few questions:

1. Who did you place the order with? Was it an online retailer, a physical store, or a marketplace like Amazon or eBay?
2. When did you place the order? Was it yesterday, last week, or a few days ago?
3. Have you checked the estimated delivery date and the tracking status on the order confirmation email or the retailer's website?

Please let me know the answers to these questions, and I'll do my best to help you locate your order and find out what's going on!

🎯 LORA MODEL RESPONSE:
----------------------------------------------------------------------
I'd be happy to help you with your order #12345. Can you please tell me a little more about the order, such as the date you placed it, the items

## Step 9: Test Multiple Scenarios

Let's test a few more scenarios, though not all would work because we had so little training data and training steps.

In [15]:
# Test multiple scenarios
test_scenarios = [
    "How do I reset my password?",
    "How can I return my item?",
    "I received a damaged product. What should I do?",
    "Does your shop offer international shipping?"
]

print("🧪 Testing Multiple Scenarios")
print("=" * 80)

for i, scenario in enumerate(test_scenarios, 1):
    print(f"\n📌 Scenario {i}: {scenario}")
    print("-" * 80)
    
    # Test both models
    base_response = test_model("meta/llama-3.1-8b-instruct", scenario)
    lora_response = test_model("customer_support_lora", scenario)
    
    print("\nBase Model:")
    print(base_response[:200] + "..." if len(base_response) > 200 else base_response)
    
    print("\nLoRA Model:")
    print(lora_response[:200] + "..." if len(lora_response) > 200 else lora_response)

print("\n" + "=" * 80)
print("✅ Testing complete! Your LoRA adapter is working perfectly!")

🧪 Testing Multiple Scenarios

📌 Scenario 1: How do I reset my password?
--------------------------------------------------------------------------------



Base Model:
Please check the help center for that type of information: https://www.meta.com/help/

LoRA Model:
I'd be happy to help you reset your password. To get started, could you please tell me a little more about your account? What is your username or email address associated with your account?

📌 Scenario 2: How can I return my item?
--------------------------------------------------------------------------------

Base Model:
To get the most accurate return information for your item, could you please provide me with more details, such as:

* The item you'd like to return
* Where you purchased the item
* Your location (coun...

LoRA Model:
To return an item, you'll need to follow the return policy of the store or retailer. Here's a general steps to return an item:

1. **Check the return policy**: Look for the return policy on the store'...

📌 Scenario 3: I received a damaged product. What should I do?
-------------------------------------------------------------------------------

## Summary

You've successfully deployed a LoRA adapter with NVIDIA NIM! Key takeaways:

✅ **Use Docker volumes** for reliable deployment on cloud GPUs  
✅ **Follow the directory structure** - each LoRA in its own subdirectory  
✅ **NIM handles the complexity** - automatic discovery, optimized serving  

### Resources

- [NVIDIA NIM Documentation](https://docs.nvidia.com/nim/)
- [NeMo Framework](https://github.com/NVIDIA/NeMo)
- [NGC Catalog](https://catalog.ngc.nvidia.com)

Happy NIM-ing! 🚀