# Module 22: NLP Model Deployment

**From Prototype to Production**

---

## 1. Objectives

- ✅ Build REST APIs with FastAPI
- ✅ Optimize models for inference
- ✅ Containerize with Docker
- ✅ Understand deployment patterns

## 2. Prerequisites

- [Module 21: RAG](../21_rag/21_rag.ipynb)

## 3. Deployment Overview

### Deployment Stack

```
┌─────────────────────────────────────┐
│           Load Balancer             │
├─────────────────────────────────────┤
│    FastAPI / Flask / Gradio         │
├─────────────────────────────────────┤
│    Model (PyTorch / ONNX)           │
├─────────────────────────────────────┤
│    Docker Container                 │
├─────────────────────────────────────┤
│    Cloud (AWS/GCP/Azure)            │
└─────────────────────────────────────┘
```

### Framework Comparison

| Framework | Best For | Latency |
|-----------|----------|--------|
| FastAPI | Production APIs | Low |
| Flask | Simple apps | Medium |
| Gradio | Demos/Prototypes | Medium |
| Streamlit | Dashboards | Higher |

In [1]:
# Install: pip install fastapi uvicorn python-multipart

import torch
from transformers import pipeline



## 4. FastAPI Basics

In [2]:
# app.py - Basic FastAPI structure

fastapi_code = '''
from fastapi import FastAPI
from pydantic import BaseModel
from transformers import pipeline

app = FastAPI(title="NLP API", version="1.0")

# Load model at startup
classifier = None

@app.on_event("startup")
async def load_model():
    global classifier
    classifier = pipeline("sentiment-analysis")

class TextRequest(BaseModel):
    text: str

class PredictionResponse(BaseModel):
    label: str
    score: float

@app.post("/predict", response_model=PredictionResponse)
async def predict(request: TextRequest):
    result = classifier(request.text)[0]
    return PredictionResponse(
        label=result["label"],
        score=result["score"]
    )

@app.get("/health")
async def health():
    return {"status": "healthy"}
'''

print(fastapi_code)


from fastapi import FastAPI
from pydantic import BaseModel
from transformers import pipeline

app = FastAPI(title="NLP API", version="1.0")

# Load model at startup
classifier = None

@app.on_event("startup")
async def load_model():
    global classifier
    classifier = pipeline("sentiment-analysis")

class TextRequest(BaseModel):
    text: str

class PredictionResponse(BaseModel):
    label: str
    score: float

@app.post("/predict", response_model=PredictionResponse)
async def predict(request: TextRequest):
    result = classifier(request.text)[0]
    return PredictionResponse(
        label=result["label"],
        score=result["score"]
    )

@app.get("/health")
async def health():
    return {"status": "healthy"}



In [3]:
# Run with: uvicorn app:app --reload
# Access docs at: http://localhost:8000/docs

# Test with curl:
# curl -X POST "http://localhost:8000/predict" \
#      -H "Content-Type: application/json" \
#      -d '{"text": "This is amazing!"}'

print("API endpoints:")
print("  POST /predict - Classify text")
print("  GET /health - Health check")
print("  GET /docs - Swagger UI")

API endpoints:
  POST /predict - Classify text
  GET /health - Health check
  GET /docs - Swagger UI


## 5. Model Optimization

### Optimization Techniques

| Technique | Speedup | Memory | Quality |
|-----------|---------|--------|--------|
| FP16 | 2x | 50% | ~Same |
| Dynamic Quant | 2-4x | 75% | Slight ↓ |
| ONNX Runtime | 2-3x | Same | Same |
| TorchScript | 1.5x | Same | Same |

In [4]:
# Dynamic Quantization
from torch.quantization import quantize_dynamic
import torch.nn as nn

class SimpleModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear1 = nn.Linear(512, 256)
        self.linear2 = nn.Linear(256, 10)

    def forward(self, x):
        return self.linear2(torch.relu(self.linear1(x)))

model = SimpleModel()

# Quantize
quantized_model = quantize_dynamic(
    model,
    {nn.Linear},
    dtype=torch.qint8
)

# Size comparison
import os
torch.save(model.state_dict(), "model.pt")
torch.save(quantized_model.state_dict(), "model_quant.pt")
print(f"Original: {os.path.getsize('model.pt') / 1024:.1f} KB")
print(f"Quantized: {os.path.getsize('model_quant.pt') / 1024:.1f} KB")

Original: 525.2 KB
Quantized: 135.3 KB


For migrations of users: 
1. Eager mode quantization (torch.ao.quantization.quantize, torch.ao.quantization.quantize_dynamic), please migrate to use torchao eager mode quantize_ API instead 
2. FX graph mode quantization (torch.ao.quantization.quantize_fx.prepare_fx,torch.ao.quantization.quantize_fx.convert_fx, please migrate to use torchao pt2e quantization API instead (prepare_pt2e, convert_pt2e) 
3. pt2e quantization has been migrated to torchao (https://github.com/pytorch/ao/tree/main/torchao/quantization/pt2e) 
see https://github.com/pytorch/ao/issues/2259 for more details
  quantized_model = quantize_dynamic(


In [5]:
# TorchScript Export

model.eval()
example_input = torch.randn(1, 512)

# Trace the model
traced_model = torch.jit.trace(model, example_input)
traced_model.save("model_traced.pt")

# Load and use
loaded = torch.jit.load("model_traced.pt")
output = loaded(example_input)
print(f"TorchScript output shape: {output.shape}")

TorchScript output shape: torch.Size([1, 10])


## 6. ONNX Export

In [10]:
# Install missing dependencies
!pip install onnx onnxscript

import torch

# Export to ONNX
torch.onnx.export(
    model,
    example_input,
    "model.onnx",
    input_names=["input"],
    output_names=["output"],
    dynamic_axes={
        "input": {0: "batch_size"},
        "output": {0: "batch_size"}
    }
)
print("Model exported to ONNX!")

# For HuggingFace models:
# from transformers.onnx import export
# export(tokenizer, model, config, output_path)



  torch.onnx.export(
W0202 16:57:21.309000 136 torch/onnx/_internal/exporter/_schemas.py:455] Missing annotation for parameter 'input' from (input, boxes, output_size: 'Sequence[int]', spatial_scale: 'float' = 1.0, sampling_ratio: 'int' = -1, aligned: 'bool' = False). Treating as an Input.
W0202 16:57:21.311000 136 torch/onnx/_internal/exporter/_schemas.py:455] Missing annotation for parameter 'boxes' from (input, boxes, output_size: 'Sequence[int]', spatial_scale: 'float' = 1.0, sampling_ratio: 'int' = -1, aligned: 'bool' = False). Treating as an Input.
W0202 16:57:21.313000 136 torch/onnx/_internal/exporter/_schemas.py:455] Missing annotation for parameter 'input' from (input, boxes, output_size: 'Sequence[int]', spatial_scale: 'float' = 1.0). Treating as an Input.
W0202 16:57:21.315000 136 torch/onnx/_internal/exporter/_schemas.py:455] Missing annotation for parameter 'boxes' from (input, boxes, output_size: 'Sequence[int]', spatial_scale: 'float' = 1.0). Treating as an Input.


[torch.onnx] Obtain model graph for `SimpleModel([...]` with `torch.export.export(..., strict=False)`...
[torch.onnx] Obtain model graph for `SimpleModel([...]` with `torch.export.export(..., strict=False)`... ✅
[torch.onnx] Run decomposition...
[torch.onnx] Run decomposition... ✅
[torch.onnx] Translate the graph into ONNX...
[torch.onnx] Translate the graph into ONNX... ✅
Model exported to ONNX!


In [11]:
# ONNX Runtime Inference
!pip install onnxruntime

import onnxruntime as ort
import numpy as np

session = ort.InferenceSession("model.onnx")
input_name = session.get_inputs()[0].name

# Run inference
result = session.run(
    None,
    {input_name: example_input.numpy()}
)
print(f"ONNX output shape: {result[0].shape}")

Collecting onnxruntime
  Downloading onnxruntime-1.23.2-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.1 kB)
Collecting coloredlogs (from onnxruntime)
  Downloading coloredlogs-15.0.1-py2.py3-none-any.whl.metadata (12 kB)
Collecting humanfriendly>=9.1 (from coloredlogs->onnxruntime)
  Downloading humanfriendly-10.0-py2.py3-none-any.whl.metadata (9.2 kB)
Downloading onnxruntime-1.23.2-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (17.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.4/17.4 MB[0m [31m54.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading coloredlogs-15.0.1-py2.py3-none-any.whl (46 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.0/46.0 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading humanfriendly-10.0-py2.py3-none-any.whl (86 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.8/86.8 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected pack

## 7. Docker Deployment

In [12]:
# Dockerfile

dockerfile = '''
FROM python:3.10-slim

WORKDIR /app

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application
COPY app.py .
COPY model/ ./model/

# Expose port
EXPOSE 8000

# Run
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]
'''

print(dockerfile)


FROM python:3.10-slim

WORKDIR /app

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application
COPY app.py .
COPY model/ ./model/

# Expose port
EXPOSE 8000

# Run
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]



In [13]:
# requirements.txt

requirements = '''
fastapi==0.104.0
uvicorn==0.24.0
transformers==4.35.0
torch==2.1.0
pydantic==2.5.0
'''

print(requirements)
print("\n# Build: docker build -t nlp-api .")
print("# Run: docker run -p 8000:8000 nlp-api")


fastapi==0.104.0
uvicorn==0.24.0
transformers==4.35.0
torch==2.1.0
pydantic==2.5.0


# Build: docker build -t nlp-api .
# Run: docker run -p 8000:8000 nlp-api


## 8. Deployment Patterns

### Common Patterns

| Pattern | Use Case |
|---------|----------|
| Sync API | Low latency, real-time |
| Async Queue | Batch processing |
| Serverless | Variable load |
| Model Server | Multi-model serving |

### Scaling Considerations

1. **Horizontal**: Multiple replicas behind load balancer
2. **Batching**: Combine requests for GPU efficiency
3. **Caching**: Cache embeddings/predictions
4. **GPU Sharing**: Use NVIDIA Triton for multi-model

## 9. Interview Questions

**Q1: How do you reduce model latency?**
<details><summary>Answer</summary>

1. Quantization (INT8/FP16)
2. ONNX Runtime / TensorRT
3. Distillation to smaller model
4. Batching requests
5. Caching frequent predictions
</details>

**Q2: What's the difference between TorchScript trace vs script?**
<details><summary>Answer</summary>

- `trace`: Records operations from example input. Doesn't capture control flow.
- `script`: Analyzes Python code directly. Handles control flow but more restrictive.
- Use trace for simple forward passes, script for complex logic.
</details>

**Q3: How do you handle model updates in production?**
<details><summary>Answer</summary>

1. Blue-green deployment: Run old/new in parallel
2. Canary: Route % of traffic to new model
3. A/B testing: Compare metrics
4. Shadow mode: Run new model without serving
</details>

## 10. Summary

- **FastAPI**: Production-ready REST APIs
- **Optimization**: Quantization, ONNX, TorchScript
- **Docker**: Containerized deployment
- **Patterns**: Sync, async, serverless, batched

## 11. References

- [FastAPI Docs](https://fastapi.tiangolo.com/)
- [ONNX Runtime](https://onnxruntime.ai/)
- [TorchServe](https://pytorch.org/serve/)
- [Docker Best Practices](https://docs.docker.com/develop/develop-images/dockerfile_best-practices/)

---
**Next:** [Module 23: NLP Evaluation & Monitoring](../23_evaluation/23_evaluation.ipynb)