# Production Model Deployment Notebook

**Purpose:** Deploy fine-tuned model to Ollama for LIMA integration

**Process:** Merge LoRA adapters → Convert to GGUF → Import to Ollama

**Reference:** [LIMA_INTEGRATION.private.md](LIMA_INTEGRATION.private.md) | [DEPLOYMENT_GUIDE.md](DEPLOYMENT_GUIDE.md)

---

## Setup: Import Dependencies & Configure Logging

In [1]:
import os
import sys
import subprocess
from pathlib import Path
from typing import Optional
import logging

# Core ML libraries
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
from dotenv import load_dotenv

# Configure logging for production visibility
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.StreamHandler(sys.stdout),
        logging.FileHandler('deployment.log')
    ]
)
logger = logging.getLogger(__name__)

  from .autonotebook import tqdm as notebook_tqdm


## Step 1: Load Configuration & Validate Environment

Load environment variables and validate all required configurations are present before proceeding with deployment.

In [None]:
logger.info("Loading environment configuration...")
load_dotenv()

# Extract configuration from environment
BASE_MODEL = os.getenv("BASE_MODEL", "Qwen/Qwen2.5-Coder-32B-Instruct")
MODEL_NAME = os.getenv("MODEL_NAME", BASE_MODEL)
OUTPUT_MODEL_NAME = os.getenv("OUTPUT_MODEL_NAME")
QUANTIZATION = os.getenv("QUANTIZATION", "Q4_K_M")  # Options: Q4_K_M, Q5_K_M, Q8_0

# Validate required configuration
if not OUTPUT_MODEL_NAME:
    raise ValueError(
        "OUTPUT_MODEL_NAME must be set in .env file.\n"
        "Example: OUTPUT_MODEL_NAME=my-custom-model"
    )

# Define paths
FINE_TUNED_PATH = Path("./fine-tuned-model") / MODEL_NAME.replace("/", "_")
MERGED_MODEL_PATH = Path("./merged-model")
OLLAMA_MODEL_PATH = Path(f"./{OUTPUT_MODEL_NAME}.gguf")

logger.info(f"Configuration loaded:")
logger.info(f"  Base Model: {BASE_MODEL}")
logger.info(f"  Fine-tuned Model Path: {FINE_TUNED_PATH}")
logger.info(f"  Output Model Name: {OUTPUT_MODEL_NAME}")
logger.info(f"  Quantization Level: {QUANTIZATION}")

# Validate fine-tuned model exists
if not FINE_TUNED_PATH.exists():
    logger.error(f"Fine-tuned model not found at: {FINE_TUNED_PATH}")
    raise FileNotFoundError(
        f"Fine-tuned model directory does not exist: {FINE_TUNED_PATH}\n"
        "Please run supervised_fine_tuning.ipynb first."
    )

logger.info("✓ Environment validation complete")

2026-01-12 20:52:38,354 - INFO - Loading environment configuration...
2026-01-12 20:52:38,359 - INFO - Configuration loaded:
2026-01-12 20:52:38,359 - INFO -   Base Model: Qwen/Qwen2.5-Coder-32B-Instruct
2026-01-12 20:52:38,360 - INFO -   Fine-tuned Model Path: fine-tuned-model/Qwen_Qwen2.5-Coder-32B-Instruct
2026-01-12 20:52:38,360 - INFO -   Output Model Name: lima-finetuned-model
2026-01-12 20:52:38,361 - INFO -   Quantization Level: Q4_K_M
2026-01-12 20:52:38,361 - INFO - ✓ Environment validation complete


## Step 2: Load Base Model & LoRA Adapters

Load the pre-trained base model and apply the fine-tuned LoRA adapters.

⚠️ **Note:** This step requires significant memory depending on model size. Consider GPU availability for large models.

In [3]:
try:
    logger.info(f"Loading base model: {BASE_MODEL}")
    logger.info("  (This may take several minutes depending on model size...)")
    
    base_model = AutoModelForCausalLM.from_pretrained(
        BASE_MODEL,
        device_map="auto",  # Automatically handle device placement
        trust_remote_code=True,  # Required for some models
        torch_dtype="auto"  # Use model's native precision
    )
    logger.info("✓ Base model loaded successfully")
    
    logger.info(f"Loading LoRA adapters from: {FINE_TUNED_PATH}")
    model = PeftModel.from_pretrained(
        base_model,
        str(FINE_TUNED_PATH),
        device_map="auto"
    )
    logger.info("✓ LoRA adapters loaded successfully")
    logger.info(f"Model loaded with {model.num_parameters():,} parameters")
    
except Exception as e:
    logger.error(f"Failed to load model: {str(e)}")
    raise RuntimeError(f"Model loading failed: {str(e)}")

2026-01-12 20:52:50,286 - INFO - Loading base model: Qwen/Qwen2.5-Coder-32B-Instruct
2026-01-12 20:52:50,287 - INFO -   (This may take several minutes depending on model size...)


`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|██████████| 14/14 [00:29<00:00,  2.11s/it]


2026-01-12 20:53:21,200 - INFO - ✓ Base model loaded successfully
2026-01-12 20:53:21,201 - INFO - Loading LoRA adapters from: fine-tuned-model/Qwen_Qwen2.5-Coder-32B-Instruct
2026-01-12 20:53:22,332 - INFO - ✓ LoRA adapters loaded successfully
2026-01-12 20:53:22,334 - INFO - Model loaded with 32,797,430,784 parameters


## Step 3: Merge LoRA Adapters into Base Model

Merge the LoRA adapter weights into the base model to create a standalone model. This is required for GGUF conversion.

⚠️ **Warning:** This operation requires memory equal to the size of the full model.

In [4]:
try:
    logger.info("Merging LoRA adapters into base model...")
    logger.info("  (This creates a standalone model without adapter overhead)")
    
    merged_model = model.merge_and_unload()
    logger.info("✓ Model merge completed successfully")
    
    # Clear memory of original model if needed
    del model
    del base_model
    import gc
    gc.collect()
    logger.info("  Memory cleanup performed")
    
except Exception as e:
    logger.error(f"Model merge failed: {str(e)}")
    raise RuntimeError(f"Failed to merge model: {str(e)}")

2026-01-12 20:53:28,170 - INFO - Merging LoRA adapters into base model...
2026-01-12 20:53:28,171 - INFO -   (This creates a standalone model without adapter overhead)
2026-01-12 20:53:29,166 - INFO - ✓ Model merge completed successfully
2026-01-12 20:53:29,266 - INFO -   Memory cleanup performed


## Step 4: Save Merged Model to Disk

Save the merged model and tokenizer in HuggingFace format. This creates a complete, standalone model that can be converted to GGUF format or shared/deployed independently.

In [5]:
try:
    logger.info(f"Saving merged model to: {MERGED_MODEL_PATH}")
    MERGED_MODEL_PATH.mkdir(parents=True, exist_ok=True)
    
    # Save model weights
    logger.info("  Saving model weights...")
    merged_model.save_pretrained(
        MERGED_MODEL_PATH,
        safe_serialization=True,  # Use safetensors format (recommended)
        max_shard_size="5GB"  # Shard large models for easier handling
    )
    
    # Save tokenizer
    logger.info("  Saving tokenizer...")
    tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, trust_remote_code=True)
    tokenizer.save_pretrained(MERGED_MODEL_PATH)
    
    # Save model card with metadata
    model_card = f"""---
base_model: {BASE_MODEL}
fine_tuned_from: {FINE_TUNED_PATH}
created: {os.popen('date').read().strip()}
quantization: {QUANTIZATION}
purpose: LIMA integration
---

# {OUTPUT_MODEL_NAME}

This is a fine-tuned version of {BASE_MODEL} optimized for LIMA application.
"""
    (MERGED_MODEL_PATH / "README.md").write_text(model_card)
    
    logger.info(f"✓ Model saved successfully to: {MERGED_MODEL_PATH}")
    logger.info(f"  Model size: {sum(f.stat().st_size for f in MERGED_MODEL_PATH.rglob('*') if f.is_file()) / (1024**3):.2f} GB")
    
except Exception as e:
    logger.error(f"Failed to save model: {str(e)}")
    raise RuntimeError(f"Model save operation failed: {str(e)}")

2026-01-12 20:53:45,653 - INFO - Saving merged model to: merged-model
2026-01-12 20:53:45,655 - INFO -   Saving model weights...
2026-01-12 20:53:58,216 - INFO -   Saving tokenizer...
2026-01-12 20:53:59,450 - INFO - ✓ Model saved successfully to: merged-model
2026-01-12 20:53:59,452 - INFO -   Model size: 61.04 GB


## Step 5: Convert to GGUF Format

Convert the HuggingFace model to GGUF format for Ollama. This requires llama.cpp tooling.

**Prerequisites:**
- llama.cpp will be automatically cloned and built
- Python requirements from llama.cpp will be installed

In [9]:
def ensure_llama_cpp():
    """Ensure llama.cpp is available and up-to-date"""
    llama_cpp_path = Path("./llama.cpp")
    
    if not llama_cpp_path.exists():
        logger.info("Cloning llama.cpp repository...")
        result = subprocess.run(
            ["git", "clone", "https://github.com/ggerganov/llama.cpp"],
            capture_output=True,
            text=True
        )
        if result.returncode != 0:
            raise RuntimeError(f"Failed to clone llama.cpp: {result.stderr}")
        logger.info("✓ llama.cpp cloned successfully")
    else:
        logger.info("llama.cpp already exists, pulling latest changes...")
        subprocess.run(["git", "-C", str(llama_cpp_path), "pull"], capture_output=True)
    
    # Install essential packages first
    logger.info("Installing essential Python packages for conversion...")
    essential_packages = [
        "numpy",
        "sentencepiece", 
        "gguf",
        "protobuf"
    ]
    
    for package in essential_packages:
        logger.info(f"  Installing {package}...")
        result = subprocess.run(
            [sys.executable, "-m", "pip", "install", "-U", package],
            capture_output=True,
            text=True
        )
        if result.returncode != 0:
            logger.warning(f"    Failed to install {package}: {result.stderr}")
        else:
            logger.info(f"    ✓ {package} installed")
    
    # Now try to install remaining requirements from llama.cpp
    requirements_file = llama_cpp_path / "requirements.txt"
    if requirements_file.exists():
        logger.info("Installing additional llama.cpp requirements...")
        result = subprocess.run(
            [sys.executable, "-m", "pip", "install", "-r", str(requirements_file)],
            capture_output=True,
            text=True
        )
        if result.returncode != 0:
            logger.warning("Some additional requirements failed, but continuing...")
        else:
            logger.info("✓ Additional requirements installed")
    
    # Verify sentencepiece is available
    try:
        import sentencepiece
        logger.info(f"✓ sentencepiece verified: version {sentencepiece.__version__}")
    except ImportError:
        raise RuntimeError(
            "sentencepiece installation failed. Please install manually:\n"
            "  pip install sentencepiece protobuf"
        )
    
    # Build quantization tools
    logger.info("Building llama.cpp tools (this may take a few minutes)...")
    result = subprocess.run(
        ["make", "-C", str(llama_cpp_path)],
        capture_output=True,
        text=True
    )
    if result.returncode != 0:
        logger.warning(f"Build had issues but continuing: {result.stderr}")
    else:
        logger.info("✓ llama.cpp tools built successfully")
    
    return llama_cpp_path

def convert_to_gguf(merged_path: Path, output_path: Path, llama_cpp_path: Path):
    """Convert HuggingFace model to GGUF format"""
    logger.info(f"Converting to GGUF format: {output_path}")
    
    convert_script = llama_cpp_path / "convert_hf_to_gguf.py"
    if not convert_script.exists():
        # Try alternative script name
        convert_script = llama_cpp_path / "convert.py"
        if not convert_script.exists():
            raise FileNotFoundError(
                f"Conversion script not found in {llama_cpp_path}.\n"
                "Expected: convert_hf_to_gguf.py or convert.py"
            )
    
    cmd = [
        sys.executable,
        str(convert_script),
        str(merged_path),
        "--outfile", str(output_path),
        "--outtype", "f16"  # Use f16 precision for unquantized version
    ]
    
    logger.info(f"  Running: {' '.join(cmd)}")
    result = subprocess.run(cmd, capture_output=True, text=True)
    
    if result.returncode != 0:
        logger.error(f"Conversion stdout: {result.stdout}")
        logger.error(f"Conversion stderr: {result.stderr}")
        raise RuntimeError(
            f"GGUF conversion failed with exit code {result.returncode}.\n"
            f"Error: {result.stderr}\n"
            "Check that the merged model format is compatible with llama.cpp."
        )
    
    logger.info("✓ GGUF conversion completed")
    return output_path

try:
    llama_cpp_path = ensure_llama_cpp()
    gguf_path = convert_to_gguf(MERGED_MODEL_PATH, OLLAMA_MODEL_PATH, llama_cpp_path)
    
    logger.info(f"✓ GGUF model created: {gguf_path}")
    logger.info(f"  File size: {gguf_path.stat().st_size / (1024**3):.2f} GB")
    
except Exception as e:
    logger.error(f"GGUF conversion failed: {str(e)}")
    logger.info("\nTroubleshooting steps:")
    logger.info("1. Manually install: pip install sentencepiece protobuf gguf numpy")
    logger.info("2. Ensure merged model was saved correctly in previous step")
    logger.info("3. Check llama.cpp GitHub for latest compatibility updates")
    logger.info("4. Verify tokenizer files exist in merged-model directory")
    raise

2026-01-12 21:01:52,082 - INFO - llama.cpp already exists, pulling latest changes...
2026-01-12 21:01:53,921 - INFO - Installing essential Python packages for conversion...
2026-01-12 21:01:53,922 - INFO -   Installing numpy...

2026-01-12 21:01:53,942 - INFO -   Installing sentencepiece...

2026-01-12 21:01:53,961 - INFO -   Installing gguf...

2026-01-12 21:01:53,980 - INFO -   Installing protobuf...

2026-01-12 21:01:53,998 - INFO - Installing additional llama.cpp requirements...
2026-01-12 21:01:54,017 - INFO - ✓ sentencepiece verified: version 0.2.1
2026-01-12 21:01:54,018 - INFO - Building llama.cpp tools (this may take a few minutes)...
 The Makefile build has been replaced by CMake.

 For build instructions see:
 https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md

.  Stop.

2026-01-12 21:01:54,031 - INFO - Converting to GGUF format: lima-finetuned-model.gguf
2026-01-12 21:01:54,031 - INFO -   Running: /Users/rahulkumar/dev/.trainingEnv/bin/python llama.cpp/convert_

## Step 6: Quantize Model & Import to Ollama

Quantize the model to reduce size (recommended for production), then create a Modelfile and import to Ollama for serving.

**Quantization Options:**
- **Q4_K_M**: 4-bit, good quality/size balance (recommended)
- **Q5_K_M**: 5-bit, better quality, larger size
- **Q8_0**: 8-bit, best quality, largest size

In [10]:
def quantize_model(input_path: Path, output_path: Path, quantization: str, llama_cpp_path: Path):
    """Quantize GGUF model to reduce size"""
    logger.info(f"Quantizing model with {quantization}...")
    
    quantize_tool = llama_cpp_path / "llama-quantize"
    if not quantize_tool.exists():
        logger.warning("Quantization tool not found, skipping quantization")
        return input_path
    
    cmd = [str(quantize_tool), str(input_path), str(output_path), quantization]
    logger.info(f"  Running: {' '.join(cmd)}")
    result = subprocess.run(cmd, capture_output=True, text=True)
    
    if result.returncode != 0:
        logger.error(f"Quantization failed: {result.stderr}")
        raise RuntimeError(f"Model quantization failed: {result.stderr}")
    
    logger.info("✓ Model quantization completed")
    logger.info(f"  Original size: {input_path.stat().st_size / (1024**3):.2f} GB")
    logger.info(f"  Quantized size: {output_path.stat().st_size / (1024**3):.2f} GB")
    logger.info(f"  Compression ratio: {input_path.stat().st_size / output_path.stat().st_size:.2f}x")
    
    return output_path

def create_ollama_model(model_path: Path, model_name: str):
    """Create Modelfile and import model to Ollama"""
    logger.info(f"Creating Ollama model: {model_name}")
    
    # Check if Ollama is available
    result = subprocess.run(["which", "ollama"], capture_output=True)
    if result.returncode != 0:
        raise RuntimeError("Ollama not found. Please install Ollama from https://ollama.ai")
    
    # Create Modelfile with proper configuration
    modelfile_content = f'''FROM ./{model_path.name}

# Template for prompt formatting
TEMPLATE """{{{{ .Prompt }}}}"""

# Model parameters optimized for LIMA
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER top_k 40
PARAMETER num_ctx 4096
PARAMETER stop "<|endoftext|>"
PARAMETER stop "<|im_end|>"

# System message for LIMA context
SYSTEM """You are a helpful AI assistant integrated with LIMA (Life Insurance & Managed Accounts) system. Provide accurate, concise answers about insurance and financial products."""
'''
    
    modelfile_path = Path("./Modelfile")
    modelfile_path.write_text(modelfile_content)
    logger.info(f"  Modelfile created: {modelfile_path}")
    
    # Import to Ollama
    logger.info("  Importing model to Ollama...")
    result = subprocess.run(
        ["ollama", "create", model_name, "-f", str(modelfile_path)],
        capture_output=True,
        text=True
    )
    
    if result.returncode != 0:
        logger.error(f"Ollama import failed: {result.stderr}")
        raise RuntimeError(f"Failed to import model to Ollama: {result.stderr}")
    
    logger.info(f"✓ Model '{model_name}' successfully imported to Ollama")
    return model_name

try:
    # Quantize if requested
    if QUANTIZATION and QUANTIZATION != "none":
        quantized_path = OLLAMA_MODEL_PATH.with_stem(
            f"{OLLAMA_MODEL_PATH.stem}-{QUANTIZATION.lower()}"
        )
        final_model_path = quantize_model(
            OLLAMA_MODEL_PATH, 
            quantized_path, 
            QUANTIZATION, 
            llama_cpp_path
        )
    else:
        logger.info("Skipping quantization (using full precision model)")
        final_model_path = OLLAMA_MODEL_PATH
    
    # Import to Ollama
    ollama_model_name = create_ollama_model(final_model_path, OUTPUT_MODEL_NAME)
    
    # Display completion message
    logger.info(f"\n{'='*60}")
    logger.info("DEPLOYMENT COMPLETE!")
    logger.info(f"{'='*60}")
    logger.info(f"Model Name: {ollama_model_name}")
    logger.info(f"Model Path: {final_model_path}")
    logger.info(f"\nTest with: ollama run {ollama_model_name} \"What is life insurance?\"")
    logger.info(f"\nTo use in LIMA, update .env with:")
    logger.info(f"  LOCAL_MODEL_NAME={ollama_model_name}")
    logger.info(f"  LOCAL_MODEL_URL=http://localhost:11434")
    logger.info(f"  LOCAL_MODEL_TYPE=ollama")
    
except Exception as e:
    logger.error(f"Ollama import failed: {str(e)}")
    raise

2026-01-12 21:03:39,370 - INFO - Quantizing model with Q4_K_M...
2026-01-12 21:03:39,371 - INFO - Creating Ollama model: lima-finetuned-model
2026-01-12 21:03:39,384 - INFO -   Modelfile created: Modelfile
2026-01-12 21:03:39,385 - INFO -   Importing model to Ollama...
2026-01-12 21:04:44,526 - INFO - ✓ Model 'lima-finetuned-model' successfully imported to Ollama
2026-01-12 21:04:44,526 - INFO - 
2026-01-12 21:04:44,527 - INFO - DEPLOYMENT COMPLETE!
2026-01-12 21:04:44,528 - INFO - Model Name: lima-finetuned-model
2026-01-12 21:04:44,528 - INFO - Model Path: lima-finetuned-model.gguf
2026-01-12 21:04:44,528 - INFO - 
Test with: ollama run lima-finetuned-model "What is life insurance?"
2026-01-12 21:04:44,529 - INFO - 
To use in LIMA, update .env with:
2026-01-12 21:04:44,529 - INFO -   LOCAL_MODEL_NAME=lima-finetuned-model
2026-01-12 21:04:44,529 - INFO -   LOCAL_MODEL_URL=http://localhost:11434
2026-01-12 21:04:44,529 - INFO -   LOCAL_MODEL_TYPE=ollama


## Step 7: Validate Deployment (Optional)

Test the deployed model through Ollama to ensure it's working correctly before integrating with LIMA.

In [None]:
def test_ollama_model(model_name: str, test_prompt: Optional[str] = None):
    """Test the deployed model with a sample prompt"""
    
    if test_prompt is None:
        test_prompt = os.getenv("TEST_PROMPT", "What is life insurance?")
    
    logger.info(f"\nTesting model: {model_name}")
    logger.info(f"Test prompt: {test_prompt}")
    logger.info("-" * 60)
    
    try:
        result = subprocess.run(
            ["ollama", "run", model_name, test_prompt],
            capture_output=True,
            text=True,
            timeout=60
        )
        
        if result.returncode != 0:
            logger.error(f"Model test failed: {result.stderr}")
            return False
        
        logger.info("Model Response:")
        logger.info(result.stdout)
        logger.info("-" * 60)
        logger.info("✓ Model test successful!")
        return True
        
    except subprocess.TimeoutExpired:
        logger.error("Model test timed out after 60 seconds")
        return False
    except Exception as e:
        logger.error(f"Model test failed: {str(e)}")
        return False

# Run validation test
try:
    test_success = test_ollama_model(OUTPUT_MODEL_NAME)
    
    if test_success:
        logger.info("\n" + "="*60)
        logger.info("✓ ALL DEPLOYMENT STEPS COMPLETED SUCCESSFULLY")
        logger.info("="*60)
        logger.info("\nNext steps:")
        logger.info("1. Update LIMA's .env file with model configuration")
        logger.info("2. Restart LIMA services to pick up new model")
        logger.info("3. Run LIMA integration tests")
        logger.info("4. Monitor model performance in production")
        logger.info("\nFor more details, see: LIMA_INTEGRATION.private.md")
    else:
        logger.warning("Model test failed. Please check Ollama logs for details.")
        
except Exception as e:
    logger.error(f"Validation failed: {str(e)}")
    logger.info("You can still use the model, but manual testing is recommended.")