<a href="https://colab.research.google.com/github/raminass/AI-Wrokshop/blob/main/workshop_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Building and Deploying AI Microservices: A Practical Guide

This tutorial will guide you through building a practical AI system using a microservices architecture. Rather than covering theoretical concepts alone, we'll build a real-world AI application step-by-step. By the end, you'll have created a fully functional AI system with multiple services working together.

We'll build a text analysis platform with three core services:
1. **Sentiment Analysis Service** (using HuggingFace)
2. **Text Generation Service** (using OpenAI API)
3. **API Gateway** (to coordinate requests)

This architecture demonstrates key microservices principles while creating something useful.

## Table of Contents

1. [Introduction to AI Microservices](#intro)
2. [Building AI Services with Python](#building)
3. [Containerizing AI Applications with Docker](#docker)
4. [Local Model Integration (HuggingFace)](#huggingface)
5. [External API Integration (OpenAI)](#openai)
6. [Deployment with Docker Compose](#deployment)
7. [Monitoring and Observability](#monitoring)
8. [Performance Optimization](#performance)


<a id="intro"></a>
## 1. Introduction to AI Microservices

**Monolithic Architecture**
- Single, unified codebase
- All components deployed together
- Shared database
- Simple to develop initially
- Challenges with scaling and maintenance

**Microservices Architecture**
- Collection of small, independent services
- Each service focused on specific business capability
- Decentralized data management
- Independent deployment
- Technology diversity


**For AI applications, typical services include:**
- API Gateway (entry point for all clients)
- Model Serving Service
- Feature Processing Service
- Data Storage Service
- Monitoring Service
- More ...

### Communication Patterns

Two main approaches:

**Synchronous (REST/HTTP):**
```
Client → Request → Server → Response → Client
```

**Asynchronous (Message Queue):**
```
Producer → Message → Queue → Message → Consumer
```


#### API Gateway Pattern
- Single entry point for all clients
- Routing, composition, protocol translation
- Security, monitoring, rate limiting
- Reduces client complexity


![Communication Patterns Diagram](https://qentelli.com/sites/default/files/inline-images/API-gateway-pattern.png)

### Real-World Microservices Examples in AI

1. **Netflix**: Recommendation engine as a service
2. **Uber**: ML-based ETA prediction and surge pricing
3. **Spotify**: Music recommendation engine
4. **Amazon**: Product recommendation services
5. **OpenAI**: API services for various AI models

<a id="building"></a>
## 2. Building AI Services with Python
### Sample Flask API Structure

In [None]:
rom flask import Flask, request, jsonify
import time

app = Flask(__name__)

@app.route('/health', methods=['GET'])
def health_check():
    return jsonify({"status": "healthy"}), 200

@app.route('/predict', methods=['POST'])
def predict():
    data = request.json

    if not data or 'text' not in data:
        return jsonify({"error": "No text provided"}), 400

    # Your ML logic here
    result = {"prediction": "positive", "confidence": 0.92}

    return jsonify(result)

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5001)

### Alternative: FastAPI for Higher Performance


In [None]:
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import uvicorn

app = FastAPI()

class TextInput(BaseModel):
    text: str

@app.get("/health")
def health_check():
    return {"status": "healthy"}

@app.post("/predict")
async def predict(input_data: TextInput):
    # Your ML logic here
    result = {"prediction": "positive", "confidence": 0.92}
    return result

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

### API Design Best Practices

1. **API First Development**: Design APIs before implementation
2. **Versioning**: Plan for evolution and backward compatibility
3. **Documentation**: OpenAPI (Swagger) specifications
4. **Security**: Authentication, authorization, encryption
5. **Idempotency**: Safe retries for failed requests
6. **Rate Limiting**: Protect services from overload
7. **Status Codes**: Proper use of HTTP status codes
8. **Pagination**: Efficient handling of large collections

<a id="docker"></a>
## 3. Containerizing AI Applications with Docker

Docker provides consistent environments from development to production.
### Installation Steps

#### macOS Installation
```bash
# Install Docker Desktop for Mac using Homebrew
brew update
brew install --cask docker

# Start Docker Desktop from Applications folder
# Or run the command:
open -a Docker
```
#### Windows Installation
```bash
# Download Docker Desktop for Windows from:
# https://www.docker.com/products/docker-desktop/

# Follow the installation wizard
# Enable WSL 2 if prompted
```

### Verify installation
```bash
# Verify installation
docker --version
docker run hello-world
```

### Test installation
```bash
docker run hello-world
```
If successful, you'll see a message indicating that your Docker installation is working correctly.


![Docker Flow](https://ucarecdn.com/2f29c783-13a6-4370-8260-0116340242e5/)

### Project Structure

```bash
text-analysis-platform/
├── sentiment-service/
│   ├── app.py
│   ├── Dockerfile
│   ├── requirements.txt
```


### Basic Dockerfile
```bash
FROM python:3.9-slim
WORKDIR /app
# Install required system dependencies for PyTorch and transformers
RUN apt-get update && apt-get install -y \
    build-essential \
    && rm -rf /var/lib/apt/lists/*
# Copy requirements file first for better caching
COPY requirements.txt .
# Install Python dependencies
RUN pip install --no-cache-dir -r requirements.txt
# Copy application code
COPY . .
# Expose the port the app runs on
EXPOSE 5005
# Command to run the application
CMD ["python", "app.py"]
```


### requirements.txt

```bash
flask==2.0.1
transformers==4.21.0
torch==1.13.1
```

### Building and Running the Sentiment Analysis Container

```bash
# Build an image
docker build -t sentiment-analyzer .
# Run a container
docker run -d -p 5005:5005 --name sentiment-api sentiment-analyzer
# Check if the container is running
docker ps
# View logs from the container
docker logs sentiment-api
# Test the API using curl
curl -X POST -H "Content-Type: application/json" -d '{"text":"I love this product!"}' http://localhost:5005/analyze
# Stop and remove the container
docker stop sentiment-api
docker rm sentiment-api
```

### Components of Production AI Systems

1. **Data Collection & Storage**
   - Data ingestion pipelines
   - Data lakes and warehouses
   - Feature stores

2. **Model Development**
   - Experimentation environments
   - Version control for ML models
   - Reproducible builds

3. **Model Serving**
   - Inference services
   - API design for ML endpoints
   - Batch vs real-time processing

4. **Model Monitoring**
   - Performance tracking
   - Drift detection
   - Alerting systems

5. **Feedback Loops**
   - User feedback collection
   - A/B testing infrastructure
   - Model retraining pipelines

### Model Serving Architectures

#### Model-as-a-Service
- Direct API calls to model
- Dedicated inference servers
- Examples: TensorFlow Serving, Triton Inference Server

#### Embedded Models
- Models deployed within application services
- Lower latency, higher coupling
- Suitable for smaller models or edge deployment

#### Batch Inference
- Scheduled batch processing of data
- Higher throughput, not real-time
- Cost-efficient for non-time-sensitive applications

#### Hybrid Approaches
- Pre-computed results with real-time fallback
- Cached inferences with live updates
- Best of both worlds for certain applications

<a id="huggingface"></a>
## 4. HuggingFace Model Integration

HuggingFace provides easy access to thousands of pre-trained models.
### Setting Up a Sentiment Analysis Service

```bash
pip install transformers torch flask
```

In [None]:
# Create a simple Flask app with HuggingFace integration
#sentiment_analyzer.py
from flask import Flask, request, jsonify
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import pipeline
import time

app = Flask(__name__)

print("Loading sentiment analysis model...")
start_time = time.time()

# Load pre-trained model and tokenizer
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Create a sentiment analysis pipeline
sentiment_analyzer = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

load_time = time.time() - start_time
print(f"Model loaded in {load_time:.2f} seconds!")

@app.route('/health', methods=['GET'])
def health_check():
    return jsonify({"status": "healthy"}), 200

@app.route('/analyze', methods=['POST'])
def analyze_text():
    data = request.json
    if not data or 'text' not in data:
        return jsonify({"error": "No text provided"}), 400

    text = data['text']

    # Perform sentiment analysis using HuggingFace
    result = sentiment_analyzer(text)

    # Format and return the result
    sentiment = {
        "label": result[0]['label'],
        "score": float(result[0]['score'])
    }

    return jsonify({"sentiment": sentiment})

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8000)

### Dockerfile for HuggingFace Service

```bash
FROM python:3.9-slim
WORKDIR /app
# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Create directory for pre-downloaded models
RUN mkdir -p /app/models
# Script to download model at build time
COPY download_model.py .
RUN python download_model.py
# Copy application code
COPY . .
# Expose port
EXPOSE 8000
# Run the application
CMD ["python", "sentiment_analyzer.py"]
```

### Model Download Script

This script pre-downloads the model during the Docker build phase, which significantly improves startup time for your container. Without this, the model would be downloaded when the container first starts.

In [None]:
# download_model.py
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import os

# Specify the model you want to use (same as in app.py)
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
cache_dir = "/app/models"

# Create cache directory if it doesn't exist
os.makedirs(cache_dir, exist_ok=True)

print(f"Downloading model: {model_name}")

# Download and cache the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir=cache_dir)
model = AutoModelForSequenceClassification.from_pretrained(model_name, cache_dir=cache_dir)

# Save model info to verify it's ready
with open(os.path.join(cache_dir, "model_info.txt"), "w") as f:
    f.write(f"Model: {model_name}\n")
    f.write(f"Tokenizer vocabulary size: {len(tokenizer)}\n")
    f.write(f"Model parameters: {model.num_parameters()}\n")

print("Model downloaded successfully!")

To use this pre-download approach, update your Dockerfile like this:

```bash
FROM python:3.9-slim
WORKDIR /app
# Install required system dependencies
RUN apt-get update && apt-get install -y \
    build-essential \
    && rm -rf /var/lib/apt/lists/*
# Copy requirements file first for better caching
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Create directory for pre-downloaded models
RUN mkdir -p /app/models
# Copy and run model download script BEFORE copying all app code
# This creates a separate layer that won't be rebuilt unless the script changes
COPY download_model.py .
RUN python download_model.py
# Now copy the rest of the application
COPY . .
# Update app.py to use the cached model (if needed)
# You would modify app.py to look for models in /app/models
EXPOSE 5005
CMD ["python", "app.py"]
```

### Additional Optimizations for Production

1. **Model Quantization**: Convert the model to FP16 for faster inference and smaller size
```python
# After loading the model
model = model.half()  # Convert to FP16
```

2. **Batch Processing**: Add a batch endpoint for processing multiple texts at once

In [None]:
@app.route('/analyze-batch', methods=['POST'])
def analyze_batch():
    data = request.json
    if not data or 'texts' not in data:
        return jsonify({"error": "No texts provided"}), 400

    texts = data['texts']
    results = sentiment_analyzer(texts)

    formatted_results = []
    for i, result in enumerate(results):
        formatted_results.append({
            "text": texts[i],
            "label": result['label'],
            "score": float(result['score'])
        })

    return jsonify({"results": formatted_results})

3. **Turn off debug mode in production**:

In [None]:
if __name__ == '__main__':
    app.run(debug=False, host='0.0.0.0', port=5005)

<a id="openai"></a>
## 5. External API Integration (OpenAI)

Using external AI APIs can be simpler than hosting your own models.

In [None]:
#penai_service.py
from flask import Flask, request, jsonify
import os
import openai

app = Flask(__name__)

# Initialize OpenAI client
openai.api_key = os.environ.get("OPENAI_API_KEY")

@app.route('/health', methods=['GET'])
def health_check():
    return jsonify({"status": "healthy"}), 200

@app.route('/generate', methods=['POST'])
def generate_text():
    data = request.json
    if not data or 'prompt' not in data:
        return jsonify({"error": "No prompt provided"}), 400

    prompt = data['prompt']
    max_tokens = data.get('max_tokens', 100)
    temperature = data.get('temperature', 0.7)

    try:
        # Call OpenAI API
        completion = openai.Completion.create(
            model="gpt-3.5-turbo-instruct",
            prompt=prompt,
            max_tokens=max_tokens,
            temperature=temperature
        )

        # Return the generated text
        return jsonify({
            "generated_text": completion.choices[0].text.strip(),
            "usage": completion.usage
        })
    except Exception as e:
        return jsonify({"error": str(e)}), 500

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8000)


### Security Best Practices for API Keys

NEVER include API keys in your code or Dockerfile
Use environment variables instead. During development:

In [None]:
import os
os.environ["OPENAI_API_KEY"] = "your-key-here"  # For testing only
# In production: Pass as environment variable in Docker
# docker run -e OPENAI_API_KEY=your-key-here openai-service

<a id="deployment"></a>
## 6. Deployment with Docker Compose

Docker Compose simplifies multi-container deployments.

### Simple Docker Compose File

```bash
#docker-compose.yml
version: '3'
services:
  frontend:
    build: ./frontend-service
    ports:
      - "3000:3000"
    depends_on:
      - api-gateway
    networks:
      - app-network

  api-gateway:
    build: ./api-gateway
    ports:
      - "8000:8000"
    depends_on:
      - sentiment-service
      - openai-service
    networks:
      - app-network
    environment:
      - SENTIMENT_SERVICE_URL=http://sentiment-service:8001
      - OPENAI_SERVICE_URL=http://openai-service:8002

  sentiment-service:
    build: ./sentiment-service
    volumes:
      - ./models:/app/models
    networks:
      - app-network

  openai-service:
    build: ./openai-service
    networks:
      - app-network
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY}

  mongodb:
    image: mongo:latest
    volumes:
      - mongo-data:/data/db
    networks:
      - app-network

networks:
  app-network:
    driver: bridge

volumes:
  mongo-data:
```

### Running the Application

```bash
# Start all services
docker-compose up -d
# View running containers
docker-compose ps
# View logs from a specific service
docker-compose logs sentiment-service
# Stop all services
docker-compose down
```


### Adding Prometheus and Grafana to Docker Compose

```bash
  prometheus:
    image: prom/prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    networks:
      - app-network

  grafana:
    image: grafana/grafana
    ports:
      - "3001:3000"
    depends_on:
      - prometheus
    networks:
      - app-network
```


### Basic Prometheus Configuration
```bash
# prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'sentiment-service'
    static_configs:
      - targets: ['sentiment-service:8001']
  
  - job_name: 'api-gateway'
    static_configs:
      - targets: ['api-gateway:8001']
```

### Example structure

```bash
text-analysis-platform/
├── sentiment-service/
├── openai-service/
├── api-gateway/
├── frontend/
│   ├── app.py
│   ├── templates/
│   │   └── index.html
│   ├── Dockerfile
│   └── requirements.txt
```

<a id="performance"></a>
## 8. Performance Optimization

### Model Optimization Techniques

In [None]:
# 1. Use smaller models when possible
model_name = "distilbert-base-uncased-finetuned-sst-2-english"  # Distilled version

# 2. Implement request batching
texts = ["I love this product", "This is terrible", "Not bad at all"]
results = sentiment_analyzer(texts)  # More efficient than individual calls

# 3. Model quantization for faster inference
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model = model.half()  # Convert to FP16 for faster

### Request Batching Implementation

In [None]:
@app.route('/analyze-batch', methods=['POST'])
def analyze_batch():
    data = request.json
    if not data or 'texts' not in data:
        return jsonify({"error": "No texts provided"}), 400

    texts = data['texts']
    if not texts or len(texts) > 100:  # Limit batch size
        return jsonify({"error": "Invalid batch size (1-100)"}), 400

    # Process batch
    results = sentiment_analyzer(texts)

    # Format results
    formatted_results = []
    for i, result in enumerate(results):
        formatted_results.append({
            "text": texts[i],
            "label": result['label'],
            "score": float(result['score'])
        })

    return jsonify({"results": formatted_results})

### Implementing Result Caching

In [None]:
from functools import lru_cache

@lru_cache(maxsize=1000)  # Cache up to 1000 results
def get_sentiment(text):
    return sentiment_analyzer(text)[0]

@app.route('/analyze-cached', methods=['POST'])
def analyze_text_cached():
    data = request.json
    if not data or 'text' not in data:
        return jsonify({"error": "No text provided"}), 400

    text = data['text']

    # Get result (will use cache if available)
    result = get_sentiment(text)

    sentiment = {
        "label": result['label'],
        "score": float(result['score'])
    }

    return jsonify({"sentiment": sentiment})


## Conclusion

This notebook has covered the practical aspects of building, containerizing, and deploying AI microservices. The techniques shown here can be adapted and expanded for a wide variety of AI applications.

### Useful Links

#### 1. Docker and Containerization
* Tutorial: [Docker for Beginners](https://www.youtube.com/watch?v=fqMOX6JJhGo) by FreeCodeCamp

#### 2. Kubernetes for Orchestration
* Course: [Kubernetes Essentials](https://www.udacity.com/course/kubernetes-essentials--ud615) on Udacity

#### 3. Microservices Architecture
* Video Series: [Microservices Architecture](https://www.youtube.com/playlist?list=PLkQkbY7JNJuDqCFncFdTzGm6cRYCF-kZO) by TECH DOSE

#### 4. CI/CD Pipelines
* Tutorial: [GitHub Actions Tutorial](https://www.youtube.com/watch?v=R8_veQiYBjI) by TechWorld with Nana

#### 5. API Design and Development
* Course: [API Development with Flask](https://www.codecademy.com/learn/paths/create-rest-apis-with-flask-and-python) on Codecademy

#### 6. Machine Learning Model Deployment
* Guide: [Deploying ML Models](https://www.tensorflow.org/tfx/guide/serving) with TensorFlow Serving
* Tutorial: [Machine Learning Model deployment with FastAPI, Streamlit and Docker](https://medium.com/latinxinai/fastapi-and-streamlit-app-with-docker-compose-e4d18d78d61d)

#### 7. Model Monitoring
* [Weights & Biases Documentation](https://docs.wandb.ai/) - Industry standard tool for experiment tracking and model monitoring
* Tutorial: [Model Monitoring with W&B](https://wandb.ai/site/model-monitoring) - Learn how to track model performance in production
* Guide: [Integrating W&B with ML Services](https://docs.wandb.ai/guides/integrations) - Step-by-step integration examples
