# AppenCorrect vLLM Testing on SageMaker

**Instance Required:** ml.g5.xlarge (NVIDIA A10G GPU, 24GB VRAM)

**What This Does:**
1. Installs vLLM 0.6.3+ (better dependency management)
2. Optionally installs Flash Attention 2 (2-3x speedup)
3. Starts vLLM server with Qwen 2.5 7B Instruct
4. Starts Flask API connected to vLLM
5. Creates ngrok tunnel for public access
6. Tests the complete system

**Time:** ~15 min first run (downloads 14GB model), then <2 min on restarts

## Step 1: Check GPU

In [None]:
!nvidia-smi

## Step 2: Install Flash Attention 2 (Optional)

Flash Attention 2 provides 2-3x speedup. Skip if installation takes too long.

In [None]:
print('Installing Flash Attention 2 (5-10 min, provides 2-3x speedup)...')
print('⚠️  If this hangs, restart kernel and skip this step - vLLM works without it\n')

!pip install flash-attn --no-build-isolation

print('\n✅ Flash Attention 2 installed')

## Step 3: Install vLLM and Dependencies

**Using vLLM 0.6.3+** which has better dependency management and avoids the `pyairports` issue.

In [None]:
# Clean install - remove old packages that cause conflicts
print('🧹 Cleaning old packages...')
!pip uninstall vllm outlines pyairports -y 2>/dev/null || true

print('\n⚡ Installing FlashInfer (CRITICAL for 2-3x speed improvement)...')
# FlashInfer provides fast sampling operations for vLLM
!pip install flashinfer -U

print('\n📦 Installing vLLM 0.6.3+ and dependencies...')
# Install vLLM 0.6.3+ (better dependency management, no pyairports issues)
!pip install vllm>=0.6.3 transformers torch pyngrok requests flask flask-cors python-dotenv jsonschema langdetect

# Verify installation
import subprocess
result = subprocess.run(['pip', 'show', 'vllm'], capture_output=True, text=True)
version_line = [line for line in result.stdout.split('\n') if 'Version:' in line]

flashinfer_result = subprocess.run(['pip', 'show', 'flashinfer'], capture_output=True, text=True)
flashinfer_version = [line for line in flashinfer_result.stdout.split('\n') if 'Version:' in line]

print(f'\n✅ All dependencies installed')
print(f'   {version_line[0] if version_line else "vLLM version: unknown"}')
print(f'   {flashinfer_version[0] if flashinfer_version else "⚠️  FlashInfer NOT installed (will be slower!)"}')
print('   No pyairports conflicts!')

## Step 4: Navigate to Repository

In [None]:
import os

# Navigate to your cloned repo
os.chdir('/home/sagemaker-user/appen-correct-localised')
!git checkout vllm
!git pull origin vllm

print(f'\n✅ Repository ready: {os.getcwd()}')

## Step 5: Start vLLM Server (Background)

**⚡ Configuration for 4096 Context Window:**
- **FlashInfer:** Enabled (2-3x faster sampling)
- **Flash Attention:** Enabled (2-3x faster attention)
- **max-model-len:** 4096 (handles ~11,900 chars / 2,000+ words)
- **gpu-memory-utilization:** 90% (~20-21GB used on L4)
- **max-num-seqs:** 8 (concurrent requests per GPU)
- **Prefix caching:** Enabled (speeds up repeated system prompts)
- **generation-config vllm:** **CRITICAL FIX** - Use vLLM's sampling config instead of model's creative defaults (prevents temperature=0.7 override)

**Performance & Capacity:**
- **Latency:** 3-6 seconds per request
- **Concurrent Requests:** 8 per GPU node
- **User Capacity:** ~80 active users per GPU (with typical usage patterns)
- **Max Text:** ~11,900 characters (~2,000+ words)

In [None]:
import subprocess, time, requests, os

# Set cache
cache = '/home/sagemaker-user/.huggingface'
os.makedirs(cache, exist_ok=True)
os.environ['HF_HOME'] = cache

# Kill existing
!pkill -f vllm.entrypoints || true
time.sleep(2)

print('🚀 Starting vLLM server with 4096 context window...')
print('⏳ First run: 5-10 min (downloads 14GB)')
print('⏳ Next runs: 30-60 sec (from cache)\n')
print('⚡ Configuration for L4 GPU (23GB VRAM):')
print('  - max-model-len: 4096 (handles ~11,900 chars / 2,000+ words)')
print('  - gpu-memory-utilization: 90%')
print('  - max-num-seqs: 8 (concurrent requests per GPU)')
print('  - FlashInfer: Enabled (2-3x faster sampling)')
print('  - Flash Attention: Enabled (2-3x faster)')
print('  - generation-config vllm: CRITICAL FIX for JSON output')
print('\n💡 Expected: 3-6s per request | Supports 80+ active users per GPU\n')

log_file = open('/tmp/vllm.log', 'w')
vllm_process = subprocess.Popen([
    'python', '-m', 'vllm.entrypoints.openai.api_server',
    '--model', 'Qwen/Qwen2.5-7B-Instruct',
    '--host', '0.0.0.0', '--port', '8000',
    '--dtype', 'auto',
    '--max-model-len', '4096',  # UPGRADED: Handles up to 11,900 chars (~2,000 words)
    '--gpu-memory-utilization', '0.90',  # 90% utilization for optimal performance
    '--max-num-seqs', '8',  # REDUCED: 8 concurrent requests (larger KV cache per request)
    '--enable-prefix-caching',
    '--trust-remote-code',
    '--generation-config', 'vllm',  # CRITICAL: Use vLLM's config, not model's creative defaults
    '--disable-log-requests'
], stdout=log_file, stderr=subprocess.STDOUT)

for i in range(120):
    try:
        if requests.get('http://localhost:8000/health', timeout=2).status_code == 200:
            print(f'\n✅ vLLM ready after {i*5}s!')
            break
    except:
        if i % 6 == 0: print(f'  Loading... ({i*5}s)')
        time.sleep(5)

print('✅ vLLM server running at http://localhost:8000')
print('📊 Concurrent requests: 8 (per GPU node)')
print('📏 Max sequence length: 4096 tokens (~11,900 chars)')
print('💾 GPU Memory: 90% utilization (~20-21GB used)')
print('👥 User capacity: ~80 active users per GPU (with 3-5s latency)')
print('\n📝 Check logs: !tail -50 /tmp/vllm.log')
print('🔍 Check FlashInfer: !grep -i flashinfer /tmp/vllm.log')

## Step 6: Test vLLM Directly

In [None]:
import requests, json

r = requests.post('http://localhost:8000/v1/completions', json={
    'model': 'Qwen/Qwen2.5-7B-Instruct',
    'prompt': 'Fix: I has a eror',
    'max_tokens': 100, 'temperature': 0.2
}, timeout=30)

print('✅ vLLM inference test:')
print(r.json()['choices'][0]['text'])

## Step 7: Start Flask API

In [None]:
import subprocess, time, requests, os

os.chdir('/home/sagemaker-user/appen-correct-localised')
os.environ['VLLM_URL'] = 'http://localhost:8000'

!pkill -f 'python.*app.py' || true
time.sleep(2)

print('🚀 Starting Flask API...')
log_file = open('/tmp/flask.log', 'w')
flask_process = subprocess.Popen(
    ['python3', 'app.py'],
    stdout=log_file,
    stderr=subprocess.STDOUT
)

time.sleep(5)
for i in range(10):
    try:
        r = requests.get('http://localhost:5006/health', timeout=2)
        if r.status_code == 200:
            print('\n✅ Flask API ready!')
            print(r.json())
            break
    except:
        time.sleep(2)
else:
    print('❌ Flask failed to start. Check logs:')
    print('!tail -50 /tmp/flask.log')

print('\nFlask running at http://localhost:5006')
print('📝 Check logs: !tail -50 /tmp/flask.log')

## Step 8: Test Complete System

In [None]:
r = requests.post('http://localhost:5006/demo/check', json={
    'text': 'I has a eror in grammer'
}, timeout=30)

print(f'Status Code: {r.status_code}')
result = r.json()

if r.status_code == 200:
    print('✅ Grammar check test:\n')
    print(f'Original:  {result.get("original_text", "N/A")}')
    print(f'Corrected: {result.get("processed_text", "N/A")}')
    print(f'\nCorrections: {len(result.get("corrections", []))}')
    for i, c in enumerate(result.get('corrections', [])[:5], 1):
        print(f'  {i}. {c["type"]}: "{c["original"]}" → "{c["suggestion"]}"')
    
    stats = result.get('statistics', {})
    print(f'\nProcessing time: {stats.get("processing_time", "N/A")}')
    print(f'API type: {stats.get("api_type", "N/A")}')
else:
    print(f'❌ Error: {result}')
    print('\nCheck Flask logs:')
    print('!tail -100 /tmp/flask.log')

## Step 9: Create ngrok Tunnel

**Get token:** https://dashboard.ngrok.com/get-started/your-authtoken

In [None]:
from pyngrok import ngrok, conf

NGROK_TOKEN = 'YOUR_TOKEN_HERE'  # ← CHANGE THIS!

if NGROK_TOKEN == 'YOUR_TOKEN_HERE':
    print('⚠️  Set your ngrok token above!')
    print('Get it: https://dashboard.ngrok.com/get-started/your-authtoken')
else:
    ngrok.kill()
    conf.get_default().auth_token = NGROK_TOKEN
    url = ngrok.connect(5006)
    
    print('='*70)
    print('🚀 PUBLIC URL:')
    print('='*70)
    print(f'\n{url}\n')
    print(f'Demo:   {url}/')
    print(f'Health: {url}/health')
    print(f'API:    {url}/demo/check')
    print('\n' + '='*70)
    print('\nShare this URL to test from anywhere!')

## Step 10: Get curl Command

In [None]:
tunnels = ngrok.get_tunnels()
if tunnels:
    url = tunnels[0].public_url
    print('Copy this curl command:\n')
    print(f'curl -X POST {url}/demo/check \\\\')
    print('  -H "Content-Type: application/json" \\\\')
    print('  -d \'{"text": "I has a eror in grammer"}\'\n')
    print(f'Or open: {url}/')
else:
    print('Run ngrok cell first!')

## Monitor GPU

In [None]:
!nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total --format=csv

## Cleanup

In [None]:
!pkill -f 'python.*app.py' || true
!pkill -f vllm.entrypoints || true
from pyngrok import ngrok
ngrok.kill()
print('✅ Stopped all services')

---

## ✅ Done!

- **vLLM:** 0.6.3+ (better dependency management, no pyairports issues)
- **Model:** Qwen 2.5 7B on L4 GPU (80% memory)
- **Flask API:** Connected to vLLM
- **ngrok:** Public access to full UI
- **Concurrency:** 12 concurrent requests (testing)
- **Max tokens:** 2048 (grammar checking with context)
- **Model cache:** Persistent (no re-download)
- **Cost:** $0.75/hr vs $3k-5k/month Gemini API

### Production Settings (EKS):
For 500+ users, scale to multiple GPUs:
- **vLLM:** 0.6.3+ (stable, production-ready)
- **GPU Memory:** 85% utilization
- **Concurrent requests:** 32-64 per GPU
- **Max tokens:** 4096
- **Auto-scaling:** KEDA (pods) + Karpenter (GPU nodes)
- **Spot instances:** 60-70% cost savings
- **Flash Attention 2:** Recommended for production