# AppenCorrect TGI Testing on SageMaker

**Instance Required:** ml.g5.xlarge or g6.xlarge (NVIDIA GPU, 24GB VRAM)

This notebook:
1. Installs TGI (Text-Generation-Inference)
2. Starts TGI server with Qwen 2.5 7B
3. Starts Flask API
4. Creates ngrok tunnel for public access
5. Tests the complete system

**Time:** ~5 min first run (downloads 14GB model), then 1 min

## Step 1: Check GPU

In [None]:
!nvidia-smi

## Step 2: Install TGI Client and Dependencies

In [None]:
!pip install text-generation aiohttp transformers torch pyngrok requests flask flask-cors python-dotenv jsonschema langdetect
print('\n✅ All dependencies installed')

## Step 3: Clone Repository

In [None]:
import os

# Navigate to your cloned repo
os.chdir('/home/sagemaker-user/appen-correct-localised')
!git checkout tgi
!git pull origin tgi

print(f'\n✅ Repository ready: {os.getcwd()}')

## Step 4: Start TGI Server (Docker)

**Note:** TGI runs best in Docker. We'll use the official HuggingFace container.

In [None]:
import subprocess, time, requests, os

# Set cache
cache = '/home/sagemaker-user/.huggingface'
os.makedirs(cache, exist_ok=True)
os.environ['HF_HOME'] = cache

# Kill existing
!docker stop tgi-server 2>/dev/null || true
!docker rm tgi-server 2>/dev/null || true
time.sleep(2)

print('🚀 Starting TGI server with Docker...')
print('⏳ First run: 3-5 min (downloads 14GB)')
print('⏳ Next runs: 30-60 sec (from cache)\n')
print('Memory Settings: Optimized for 24GB VRAM GPU')
print('  - Model: ~14GB')
print('  - KV Cache: ~8GB')
print('  - Reserved: ~2GB\n')

# Start TGI in Docker (background)
!docker run -d \\
    --name tgi-server \\
    --gpus all \\
    -p 8080:80 \\
    -v {cache}:/data \\
    ghcr.io/huggingface/text-generation-inference:latest \\
    --model-id Qwen/Qwen2.5-7B-Instruct \\
    --max-concurrent-requests 8 \\
    --max-input-length 512 \\
    --max-total-tokens 1536 \\
    --dtype auto

# Wait for TGI to be ready
print('Waiting for TGI server...')
for i in range(60):
    try:
        r = requests.get('http://localhost:8080/health', timeout=2)
        if r.status_code == 200:
            print(f'\n✅ TGI ready after {i*5}s!')
            break
    except:
        if i % 3 == 0: print(f'  Loading... ({i*5}s)')
        time.sleep(5)
else:
    print('❌ Timeout waiting for TGI. Check logs: !docker logs tgi-server')

print('\n✅ TGI server running at http://localhost:8080')
print('📊 Concurrent requests: 8 (enough for testing)')
print('📏 Max tokens: 1536 (512 input + 1024 output)')

## Step 5: Test TGI Directly

In [None]:
import requests

r = requests.post('http://localhost:8080/generate', json={
    'inputs': 'Fix: I has a eror',
    'parameters': {'max_new_tokens': 100, 'temperature': 0.2}
}, timeout=30)

print('✅ TGI inference test:')
print(r.json()['generated_text'])

## Step 6: Start Flask API

In [None]:
import subprocess, time, requests, os

os.chdir('/home/sagemaker-user/appen-correct-localised')
os.environ['TGI_URL'] = 'http://localhost:8080'

!pkill -f 'python.*app.py' || true
time.sleep(2)

print('🚀 Starting Flask API...')
flask_process = subprocess.Popen(
    ['python3', 'app.py'],
    stdout=subprocess.DEVNULL,
    stderr=subprocess.DEVNULL
)

time.sleep(5)
for i in range(10):
    try:
        r = requests.get('http://localhost:5006/health', timeout=2)
        if r.status_code == 200:
            print('\n✅ Flask API ready!')
            print(r.json())
            break
    except:
        time.sleep(2)

print('Flask running at http://localhost:5006')

## Step 7: Test Complete System

In [None]:
r = requests.post('http://localhost:5006/demo/check', json={
    'text': 'I has a eror in grammer'
}, timeout=30)

result = r.json()
print('✅ Grammar check test:\n')
print(f'Original:  {result["original_text"]}')
print(f'Corrected: {result["corrected_text"]}')
print(f'\nErrors: {len(result["errors"])}')
for e in result['errors'][:3]:
    print(f'  {e["original"]} → {e["suggestion"]} ({e["type"]})')

## Step 8: Create ngrok Tunnel

**Get token:** https://dashboard.ngrok.com/get-started/your-authtoken

In [None]:
from pyngrok import ngrok, conf

NGROK_TOKEN = 'YOUR_TOKEN_HERE'  # ← CHANGE THIS!

if NGROK_TOKEN == 'YOUR_TOKEN_HERE':
    print('⚠️  Set your ngrok token above!')
    print('Get it: https://dashboard.ngrok.com/get-started/your-authtoken')
else:
    ngrok.kill()
    conf.get_default().auth_token = NGROK_TOKEN
    url = ngrok.connect(5006)
    
    print('='*70)
    print('🚀 PUBLIC URL:')
    print('='*70)
    print(f'\n{url}\n')
    print(f'Demo:   {url}/')
    print(f'Health: {url}/health')
    print(f'API:    {url}/demo/check')
    print('\n' + '='*70)
    print('\nShare this URL to test from anywhere!')

## Step 9: Get curl Command

In [None]:
tunnels = ngrok.get_tunnels()
if tunnels:
    url = tunnels[0].public_url
    print('Copy this curl command:\n')
    print(f'curl -X POST {url}/demo/check \\\\')
    print('  -H "Content-Type: application/json" \\\\')
    print('  -d \'{"text": "I has a eror in grammer"}\'\n')
    print(f'Or open: {url}/')
else:
    print('Run ngrok cell first!')

## Monitor GPU

In [None]:
!nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total --format=csv

## Cleanup

In [None]:
!pkill -f 'python.*app.py' || true
!docker stop tgi-server || true
from pyngrok import ngrok
ngrok.kill()
print('✅ Stopped all services')

---

## ✅ Done!

- **TGI server:** Qwen 2.5 7B on GPU (Docker)
- **Flask API:** Connected to TGI
- **ngrok:** Public access to full UI
- **Concurrency:** 8 concurrent requests
- **Model:** Cached (no re-download)
- **Cost:** $0.75/hr vs $3k-5k/month Gemini API
- **NO pyairports issues!** ✨