# Local Model Serving with FastAPI

**Module 04 | Notebook 1 of 4**

Learn to create production-ready REST APIs for your ML models using FastAPI.

## Learning Objectives

By the end of this notebook, you will be able to:
1. Create a FastAPI application for model serving
2. Implement prediction endpoints
3. Handle input validation with Pydantic
4. Test your API locally

---

In [14]:
%%capture
!pip install transformers torch fastapi uvicorn pydantic

In [15]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import warnings
warnings.filterwarnings('ignore')

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


---

## Why FastAPI?

### REST API Serving Pattern

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê     HTTP Request      ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ   Client    ‚îÇ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚Üí ‚îÇ   FastAPI   ‚îÇ
‚îÇ  (Browser,  ‚îÇ     {"text": "..."}   ‚îÇ   Server    ‚îÇ
‚îÇ   Mobile)   ‚îÇ                       ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò                              ‚îÇ
       ‚Üë                                     ‚ñº
       ‚îÇ                              ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
       ‚îÇ      HTTP Response           ‚îÇ    Model    ‚îÇ
       ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ ‚îÇ  Inference  ‚îÇ
             {"label": "POSITIVE"}    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

### FastAPI Advantages

| Feature | Benefit |
|---------|--------|
| **Automatic docs** | Swagger UI out of the box |
| **Type hints** | Automatic validation |
| **Async support** | High concurrency |
| **Fast** | One of the fastest Python frameworks |
| **Modern** | Native Python 3.6+ features |

---

## Load the Model

In [16]:
# Load model and tokenizer
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name).to(device)
model.eval()

print(f"Model loaded: {model_name}")
print(f"Labels: {model.config.id2label}")

Model loaded: distilbert-base-uncased-finetuned-sst-2-english
Labels: {0: 'NEGATIVE', 1: 'POSITIVE'}


In [17]:
# Test prediction function
def predict(text: str) -> dict:
    """Run inference on input text."""
    inputs = tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
        max_length=512
    ).to(device)
    
    with torch.no_grad():
        outputs = model(**inputs)
    
    probs = torch.softmax(outputs.logits, dim=-1)[0]
    pred_idx = probs.argmax().item()
    
    return {
        "label": model.config.id2label[pred_idx],
        "confidence": probs[pred_idx].item(),
        "probabilities": {
            model.config.id2label[i]: probs[i].item() 
            for i in range(len(probs))
        }
    }

# Test
result = predict("This movie was fantastic!")
print(f"Test prediction: {result}")

Test prediction: {'label': 'POSITIVE', 'confidence': 0.9998781681060791, 'probabilities': {'NEGATIVE': 0.00012178818724351004, 'POSITIVE': 0.9998781681060791}}


---

## Create the FastAPI Application

Here's the complete FastAPI application code. In a production setting, you would save this to a file.

In [None]:
# FastAPI application code
app_code = '''
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
from typing import Dict, List, Optional
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Initialize app
app = FastAPI(
    title="Sentiment Analysis API",
    description="A REST API for sentiment classification using DistilBERT",
    version="1.0.0"
)

# Load model at startup
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
model.eval()

# Request/Response schemas
class PredictionRequest(BaseModel):
    text: str = Field(..., min_length=1, max_length=5000, description="Text to classify")
    
    class Config:
        json_schema_extra = {
            "example": {"text": "This movie was absolutely fantastic!"}
        }

class PredictionResponse(BaseModel):
    label: str
    confidence: float
    probabilities: Dict[str, float]

class BatchRequest(BaseModel):
    texts: List[str] = Field(..., max_length=100)

class HealthResponse(BaseModel):
    status: str
    model: str
    device: str

# Endpoints
@app.get("/health", response_model=HealthResponse)
def health_check():
    """Check if the API is running and model is loaded."""
    return {
        "status": "healthy",
        "model": model_name,
        "device": str(device)
    }

@app.post("/predict", response_model=PredictionResponse)
def predict_sentiment(request: PredictionRequest):
    """Predict sentiment for a single text."""
    try:
        inputs = tokenizer(
            request.text,
            return_tensors="pt",
            truncation=True,
            max_length=512
        ).to(device)
        
        with torch.no_grad():
            outputs = model(**inputs)
        
        probs = torch.softmax(outputs.logits, dim=-1)[0]
        pred_idx = probs.argmax().item()
        
        return {
            "label": model.config.id2label[pred_idx],
            "confidence": probs[pred_idx].item(),
            "probabilities": {
                model.config.id2label[i]: probs[i].item() 
                for i in range(len(probs))
            }
        }
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.post("/predict/batch", response_model=List[PredictionResponse])
def predict_batch(request: BatchRequest):
    """Predict sentiment for multiple texts."""
    results = []
    for text in request.texts:
        req = PredictionRequest(text=text)
        results.append(predict_sentiment(req))
    return results

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)
'''

# Display the code
print("FastAPI Application Code:")
print("=" * 60)
print(app_code)

FastAPI Application Code:

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
from typing import Dict, List, Optional
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Initialize app
app = FastAPI(
    title="Sentiment Analysis API",
    description="A REST API for sentiment classification using DistilBERT",
    version="1.0.0"
)

# Load model at startup
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
model.eval()

# Request/Response schemas
class PredictionRequest(BaseModel):
    text: str = Field(..., min_length=1, max_length=5000, description="Text to classify")
    
    class Config:
        json_schema_extra = {
            "example": {"text": "This movie was absolutely fantastic!

In [22]:
# Save to file
with open("./app.py", "w") as f:
    f.write(app_code)

print("‚úÖ Application saved to app.py")
print("\nTo run the server:")
print("  python app.py")
print("  OR")
print("  uvicorn app:app --reload --host 0.0.0.0 --port 8000")

‚úÖ Application saved to app.py

To run the server:
  python app.py
  OR
  uvicorn app:app --reload --host 0.0.0.0 --port 8000


In [24]:
# !python app.py

---

## Understanding the Application

### Request/Response Models (Pydantic)

```python
class PredictionRequest(BaseModel):
    text: str = Field(..., min_length=1, max_length=5000)
```

This provides:
- **Automatic validation** (text must be 1-5000 characters)
- **Documentation** (shown in Swagger UI)
- **Type hints** for IDE support

### Endpoints

| Endpoint | Method | Description |
|----------|--------|-------------|
| `/health` | GET | Check API status |
| `/predict` | POST | Single text prediction |
| `/predict/batch` | POST | Batch predictions |
| `/docs` | GET | Swagger UI (automatic) |
| `/redoc` | GET | ReDoc UI (automatic) |

---

## Testing the API

Once the server is running, you can test it using `requests`:

In [26]:
# Example client code (run when server is active)
client_code = '''
import requests

BASE_URL = "http://localhost:8000"

# Health check
response = requests.get(f"{BASE_URL}/health")
print("Health:", response.json())

# Single prediction
response = requests.post(
    f"{BASE_URL}/predict",
    json={"text": "This movie was fantastic!"}
)
print("Prediction:", response.json())

# Batch prediction
response = requests.post(
    f"{BASE_URL}/predict/batch",
    json={
        "texts": [
            "I love this product!",
            "Terrible experience, never again.",
            "It was okay."
        ]
    }
)
print("Batch:", response.json())
'''

print("Client Test Code:")
print("=" * 60)
print(client_code)

# Save client code
with open("./test_client.py", "w") as f:
    f.write(client_code)
print("\n‚úÖ Client code saved to test_client.py")

Client Test Code:

import requests

BASE_URL = "http://localhost:8000"

# Health check
response = requests.get(f"{BASE_URL}/health")
print("Health:", response.json())

# Single prediction
response = requests.post(
    f"{BASE_URL}/predict",
    json={"text": "This movie was fantastic!"}
)
print("Prediction:", response.json())

# Batch prediction
response = requests.post(
    f"{BASE_URL}/predict/batch",
    json={
        "texts": [
            "I love this product!",
            "Terrible experience, never again.",
            "It was okay."
        ]
    }
)
print("Batch:", response.json())


‚úÖ Client code saved to test_client.py


---

## Running the Server

### ‚ö†Ô∏è Environment-Specific Instructions

The way you run the FastAPI server depends on your environment:

| Environment | How to Run Server | How to Test |
|-------------|-------------------|-------------|
| **Colab/Jupyter** | Background subprocess (see below) | Run code in next cell |
| **Local (2 terminals)** | `python app.py` | `python test_client.py` |
| **Production** | `uvicorn app:app --host 0.0.0.0 --port 8000` | HTTP client or curl |

---

### Running in Google Colab (Notebook Environment)

In Colab, we can't open a second terminal, so we start the server as a **background process** using `subprocess`. This allows the notebook to continue executing while the server runs in the background.

> **Note:** This is a workaround for learning/demo purposes. In production, you would run the server as a standalone process or container.

In [None]:
# ============================================================
# START SERVER IN BACKGROUND (Colab/Jupyter Only)
# ============================================================
# In a notebook environment, we can't run the server in a separate 
# terminal. Instead, we use subprocess to run it in the background.
#
# PRODUCTION ALTERNATIVE:
# - Open a terminal and run: python app.py
# - Or use: uvicorn app:app --reload --host 0.0.0.0 --port 8000
# - Or deploy with Docker/Kubernetes (see Module 04 notebooks)
# ============================================================

import subprocess
import time

# Start the FastAPI server as a background process
print("Starting FastAPI server in background...")
server_process = subprocess.Popen(
    ["python", "app.py"],
    stdout=subprocess.PIPE,
    stderr=subprocess.PIPE
)

# Wait for the server to start and model to load
# This takes ~10-20 seconds due to model initialization
print("Waiting for model to load (this may take 15-20 seconds)...")
time.sleep(20)

print("‚úÖ Server should be running on http://localhost:8000")
print("   - API docs: http://localhost:8000/docs")
print("   - Health check: http://localhost:8000/health")

Server started!


---

## Testing the API

Now that the server is running in the background, we can send HTTP requests to it directly from this notebook.

> **In Production:** You would typically test using:
> - `curl` commands from terminal
> - A separate test script (`python test_client.py`)
> - Automated tests with `pytest` and `httpx`
> - API testing tools like Postman or Insomnia

In [30]:
import requests

BASE_URL = "http://localhost:8000"

# Health check
response = requests.get(f"{BASE_URL}/health")
print("Health:", response.json())

# Prediction
response = requests.post(
    f"{BASE_URL}/predict",
    json={"text": "This movie was fantastic!"}
)
print("Prediction:", response.json())

Health: {'status': 'healthy', 'model': 'distilbert-base-uncased-finetuned-sst-2-english', 'device': 'cuda'}
Prediction: {'label': 'POSITIVE', 'confidence': 0.9998781681060791, 'probabilities': {'NEGATIVE': 0.00012178818724351004, 'POSITIVE': 0.9998781681060791}}


In [None]:
# ============================================================
# TEST THE API
# ============================================================
# This code sends requests to our running FastAPI server.
# In production, this would be in a separate test_client.py file.
# ============================================================

import requests

BASE_URL = "http://localhost:8000"

# Health check
print("1. Health Check:")
print("-" * 40)
response = requests.get(f"{BASE_URL}/health")
print(response.json())

# Single prediction
print("\n2. Single Prediction:")
print("-" * 40)
response = requests.post(
    f"{BASE_URL}/predict",
    json={"text": "This movie was fantastic!"}
)
print(response.json())

# Batch prediction
print("\n3. Batch Prediction:")
print("-" * 40)
response = requests.post(
    f"{BASE_URL}/predict/batch",
    json={
        "texts": [
            "I love this product!",
            "Terrible experience, never again.",
            "It was okay."
        ]
    }
)
for i, result in enumerate(response.json()):
    print(f"  Text {i+1}: {result['label']} ({result['confidence']:.2%})")

Health: {'status': 'healthy', 'model': 'distilbert-base-uncased-finetuned-sst-2-english', 'device': 'cuda'}
Prediction: {'label': 'POSITIVE', 'confidence': 0.9998781681060791, 'probabilities': {'NEGATIVE': 0.00012178818724351004, 'POSITIVE': 0.9998781681060791}}
Batch: [{'label': 'POSITIVE', 'confidence': 0.9998855590820312, 'probabilities': {'NEGATIVE': 0.00011442836694186553, 'POSITIVE': 0.9998855590820312}}, {'label': 'NEGATIVE', 'confidence': 0.9902605414390564, 'probabilities': {'NEGATIVE': 0.9902605414390564, 'POSITIVE': 0.009739442728459835}}, {'label': 'POSITIVE', 'confidence': 0.9998270869255066, 'probabilities': {'NEGATIVE': 0.00017293228302150965, 'POSITIVE': 0.9998270869255066}}]


---

## Production Best Practices

### 1. Model Loading
- Load model ONCE at startup, not per request
- Use `@app.on_event("startup")` for initialization

### 2. Error Handling
- Use try/except and HTTPException
- Return meaningful error messages

### 3. Logging
```python
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
```

### 4. Rate Limiting
```python
from fastapi_limiter import FastAPILimiter
```

### 5. CORS (for web clients)
```python
...
### 7. Health Checks
- Include model status, memory usage, GPU utilization
- Kubernetes/Docker can use for readiness probes

---

## üéØ Student Challenge

### Challenge: Add New Endpoints

In [None]:
# TODO: Extend the API with these features:

# 1. Add a `/tokenize` endpoint that returns token information
#    - Input: {"text": "..."}
#    - Output: {"tokens": [...], "token_ids": [...], "num_tokens": N}

# 2. Add model info endpoint `/model/info`
#    - Output: {"name": "...", "parameters": N, "vocab_size": N}

# 3. Add request timing middleware
#    - Log request duration for each call

# Your solution:


---

## üìù Key Takeaways

1. **FastAPI** provides automatic docs, validation, and async support
2. **Pydantic models** define request/response schemas with validation
3. **Load models once** at startup for efficiency
4. **Health endpoints** are essential for production monitoring
5. **Batch endpoints** improve throughput for multiple requests

---

## ‚û°Ô∏è Next Steps

Continue to `02_gradio_ui.ipynb` for interactive web interfaces!