# Embedded GPU Qwen2.5-7B Intelligence Extraction - SageMaker Studio

This notebook runs the Streamlit app with **embedded GPU inference** (no Ollama needed).

**Branch:** `embedded-microbatch`

---

## Features:
- ‚ö° Direct GPU inference (no REST API overhead)
- üß± 4-bit quantization (bitsandbytes)
- üöÄ Micro-batching for higher throughput  
- üí® FlashAttention2 / SDPA for optimized attention
- üåê Public URL via ngrok

---

## Requirements:
- GPU-enabled SageMaker instance (ml.g4dn.xlarge or better)
- ~8GB GPU memory minimum
- ~5GB disk space for model cache

## 1. Install Dependencies

In [None]:
%%time
!pip install -q streamlit pyngrok pandas pydantic pydantic-settings python-dotenv py-spy
!pip install -q torch transformers accelerate bitsandbytes einops

print("\n‚úÖ All dependencies installed!")
print("   Streamlit, torch, transformers, accelerate, bitsandbytes, einops")

## 2. Configure Ngrok Token

**Get your free token:** https://dashboard.ngrok.com/get-started/your-authtoken

In [None]:
# REPLACE THIS WITH YOUR ACTUAL NGROK TOKEN
NGROK_TOKEN = "YOUR_NGROK_TOKEN_HERE"

if NGROK_TOKEN == "YOUR_NGROK_TOKEN_HERE":
    print("‚ùå ERROR: Set your ngrok token above!")
    print("   Get it from: https://dashboard.ngrok.com/get-started/your-authtoken")
else:
    print(f"‚úÖ Ngrok token configured")

## 3. Pre-Download Qwen2.5-7B Model (First Run Only)

This downloads and caches the model (~4.3 GB). Subsequent runs will be instant.

**Time:** 5-10 minutes on first run, instant after that.

In [None]:
%%time
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

print("="*80)
print("üì• PRE-DOWNLOADING QWEN2.5-7B-INSTRUCT MODEL")
print("="*80)

MODEL_NAME = "Qwen/Qwen2.5-7B-Instruct"
USE_4BIT = True

print(f"\nüì¶ Model: {MODEL_NAME}")
print(f"üîß 4-bit quantization: {USE_4BIT}")
print(f"üíæ Cache: ~/.cache/huggingface/hub/")
print(f"üñ•Ô∏è  Device: {'GPU' if torch.cuda.is_available() else 'CPU'}")

if torch.cuda.is_available():
    print(f"‚úÖ GPU detected: {torch.cuda.get_device_name(0)}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
else:
    print("‚ö†Ô∏è  No GPU detected - model will run on CPU (slower)")

print("\n" + "="*80)
print("STEP 1: Downloading Tokenizer")
print("="*80)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
print("‚úÖ Tokenizer downloaded (~1 MB)")

print("\n" + "="*80)
print("STEP 2: Downloading Model Weights (~4.3 GB)")
print("="*80)
print("‚è≥ This will take 5-10 minutes on first run...")
print("   Subsequent runs will be instant (cached)\n")

if USE_4BIT:
    from transformers import BitsAndBytesConfig
    print("üß± Loading with 4-bit quantization (bitsandbytes)")
    quant_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.float16,
        bnb_4bit_use_double_quant=True
    )
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_NAME,
        quantization_config=quant_config,
        device_map="auto",
        low_cpu_mem_usage=True,
        trust_remote_code=True,
    )
    print("‚úÖ Model downloaded and quantized to 4-bit")
    print(f"   Memory usage: ~2.7 GB")
else:
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_NAME,
        torch_dtype=torch.float16,
        device_map="auto",
        low_cpu_mem_usage=True,
        trust_remote_code=True,
    )
    print("‚úÖ Model downloaded (FP16)")
    print(f"   Memory usage: ~14 GB")

print("\n" + "="*80)
print("STEP 3: Testing Generation")
print("="*80)

inputs = tokenizer("Hello, I am", return_tensors="pt")
if torch.cuda.is_available():
    inputs = {k: v.to("cuda") for k, v in inputs.items()}

with torch.inference_mode():
    outputs = model.generate(**inputs, max_new_tokens=10, temperature=0.7)

result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"‚úÖ Test generation successful!")
print(f"   Input: 'Hello, I am'")
print(f"   Output: '{result}'")

# Clean up to free memory
del model
del tokenizer
if torch.cuda.is_available():
    torch.cuda.empty_cache()

print("\n" + "="*80)
print("‚úÖ MODEL READY FOR EMBEDDED GPU INFERENCE!")
print("="*80)
print(f"‚úÖ Model: {MODEL_NAME}")
print(f"‚úÖ Quantization: {'4-bit (NF4)' if USE_4BIT else 'FP16'}")
print(f"‚úÖ Cache: ~/.cache/huggingface/hub/")
print(f"‚úÖ Device: {'GPU' if torch.cuda.is_available() else 'CPU'}")
print("\nStreamlit app will use this cached model for instant startup! üöÄ")
print("="*80)

## 4. Setup SQLite Database

In [None]:
import sqlite3

# Create .env file
with open('.env', 'w') as f:
    f.write('''DATABASE_URL=sqlite:////tmp/contributor_intelligence.db

# Embedded GPU config
EMBED_MODEL=Qwen/Qwen2.5-7B-Instruct
EMBED_4BIT=1
INFER_CONCURRENCY=8
MICRO_BATCH_SIZE=32
BATCH_LATENCY_MS=120
''')

print("‚úÖ Environment configured (SQLite + Embedded GPU)")

# Create SQLite database
db_path = '/tmp/contributor_intelligence.db'
conn = sqlite3.connect(db_path)
conn.execute('''
    CREATE TABLE IF NOT EXISTS contributors (
        email TEXT PRIMARY KEY,
        contributor_id TEXT UNIQUE NOT NULL,
        processed_data TEXT NOT NULL,
        intelligence_summary TEXT,
        processing_status TEXT DEFAULT 'pending',
        error_message TEXT,
        created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
        updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
        intelligence_extracted_at TIMESTAMP
    )
''')
conn.execute('CREATE INDEX IF NOT EXISTS idx_status ON contributors(processing_status)')
conn.commit()
conn.close()

print(f"‚úÖ SQLite database: {db_path}")

## 5. Configure Ngrok

In [None]:
from pyngrok import ngrok, conf

conf.get_default().auth_token = NGROK_TOKEN
print("‚úÖ Ngrok configured")

## 6. Start Streamlit App

In [None]:
import subprocess
import time

# Stop any existing Streamlit
subprocess.run(['pkill', '-f', 'streamlit'], capture_output=True)
time.sleep(2)

# Start Streamlit
print("üé® Starting Streamlit app with embedded GPU...")
with open('/tmp/streamlit.log', 'w') as log:
    streamlit_process = subprocess.Popen(
        ['streamlit', 'run', 'app.py', 
         '--server.port', '8501',
         '--server.headless', 'true'],
        stdout=log,
        stderr=log
    )

time.sleep(8)
print(f"‚úÖ Streamlit started (PID: {streamlit_process.pid})")
print("‚è≥ Model loading in background (first request will take ~30s)...")

## 7. Create Public URL üåê

In [None]:
try:
    ngrok.kill()
    time.sleep(2)
    public_url = ngrok.connect(8501, bind_tls=True)
    
    print("\n" + "="*80)
    print("üåê PUBLIC URL READY!")
    print("="*80)
    print(f"\nüîó {public_url}")
    print("\n" + "="*80)
    print("EMBEDDED GPU FEATURES:")
    print("="*80)
    print("‚úÖ 4-bit quantization (bitsandbytes)")
    print("‚úÖ Micro-batching (32 batch, 120ms latency)")
    print("‚úÖ Semaphore concurrency (8 slots)")
    print("‚úÖ FlashAttention2 / SDPA")
    print("‚úÖ NO Ollama server needed!")
    print("\n‚ö†Ô∏è  Keep this cell running!")
    print("="*80)
    
except Exception as e:
    print(f"‚ùå Error: {e}")

## 8. Stop Services

In [None]:
import torch

ngrok.kill()
subprocess.run(['pkill', '-f', 'streamlit'], capture_output=True)
if torch.cuda.is_available():
    torch.cuda.empty_cache()

print("‚úÖ All services stopped")