Skip to content

roy6n23/ServeStack

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ML Inference Platform

A production-ready, high-performance machine learning inference platform built for learning and experimentation. This project demonstrates best practices for deploying ML models at scale with intelligent routing, caching, and comprehensive monitoring.

Perfect for beginners who want to understand how real-world ML inference systems work.

Features

  • Multi-Model Serving - Deploy ResNet18, ResNet50, and EfficientNet simultaneously via NVIDIA Triton Inference Server
  • Intelligent Routing - Automatically select models based on complexity requirements (speed vs accuracy trade-off)
  • Redis Caching - Avoid redundant inference calls with content-based caching
  • Prometheus Metrics - Full observability with request rates, latencies, cache hit rates, and error tracking
  • Grafana Dashboard - Pre-configured visualizations, ready out-of-the-box
  • Load Testing - Built-in Locust scripts for performance benchmarking

Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                              Client Request                                  │
│                            (Image + Complexity)                              │
└─────────────────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                           FastAPI Gateway (:8080)                            │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────┐ │
│  │   /predict  │  │   /health   │  │   /models   │  │      /metrics       │ │
│  └─────────────┘  └─────────────┘  └─────────────┘  └─────────────────────┘ │
│         │                                                      │             │
│         ▼                                                      │             │
│  ┌─────────────────────────────────────────────┐               │             │
│  │          Intelligent Model Router           │               │             │
│  │  ┌─────────┐  ┌─────────┐  ┌─────────────┐  │               │             │
│  │  │ simple  │  │ medium  │  │   complex   │  │               │             │
│  │  │ResNet18 │  │EffNet-B0│  │  ResNet50   │  │               │             │
│  │  │ (fast)  │  │(balanced)│  │ (accurate) │  │               │             │
│  │  └─────────┘  └─────────┘  └─────────────┘  │               │             │
│  └─────────────────────────────────────────────┘               │             │
│         │                                                      │             │
└─────────┼──────────────────────────────────────────────────────┼─────────────┘
          │                                                      │
          ▼                                                      ▼
┌───────────────────┐                                 ┌───────────────────────┐
│   Redis Cache     │                                 │     Prometheus        │
│     (:6379)       │                                 │       (:9090)         │
│                   │                                 │                       │
│  ┌─────────────┐  │                                 │  • Request Count      │
│  │ Hash-based  │  │                                 │  • Latency Histogram  │
│  │   Caching   │  │                                 │  • Cache Hit Rate     │
│  │  (1h TTL)   │  │                                 │  • Error Rate         │
│  └─────────────┘  │                                 │  • Active Requests    │
└───────────────────┘                                 └───────────┬───────────┘
                                                                  │
          │                                                       ▼
          ▼                                           ┌───────────────────────┐
┌─────────────────────────────────────────────┐       │       Grafana         │
│         Triton Inference Server             │       │       (:3000)         │
│              (:8000/8001/8002)              │       │                       │
│                                             │       │  ┌─────────────────┐  │
│  ┌───────────┐ ┌───────────┐ ┌───────────┐  │       │  │   ML Gateway    │  │
│  │ ResNet18  │ │ ResNet50  │ │EffNet-B0  │  │       │  │   Dashboard     │  │
│  │  model.pt │ │  model.pt │ │  model.pt │  │       │  └─────────────────┘  │
│  └───────────┘ └───────────┘ └───────────┘  │       │                       │
│                                             │       │  Pre-configured with  │
│         PyTorch TorchScript Models          │       │  9 visualization      │
│                                             │       │  panels               │
└─────────────────────────────────────────────┘       └───────────────────────┘

Project Structure

Serving-ResNet/
├── docker-compose.yml          # One-command infrastructure setup
├── requirements.txt            # Python dependencies
├── prometheus.yml              # Prometheus scrape configuration
├── cat.jpg                     # Sample test image
│
├── models/
│   ├── models.py               # Export PyTorch models to TorchScript
│   └── model_repository/       # Triton model repository
│       ├── resnet18/
│       │   ├── config.pbtxt    # Model configuration
│       │   └── 1/model.pt      # TorchScript model (generated)
│       ├── resnet50/
│       │   ├── config.pbtxt
│       │   └── 1/model.pt
│       └── efficientnet/
│           ├── config.pbtxt
│           └── 1/model.pt
│
├── gateway/
│   ├── gateway.py              # FastAPI inference gateway
│   └── load_test.py            # Locust load testing script
│
├── benchmark/
│   └── benchmark.py            # Direct Triton benchmarking tool
│
└── grafana/
    ├── provisioning/
    │   ├── datasources/        # Auto-configure Prometheus
    │   └── dashboards/         # Auto-load dashboards
    └── dashboards/
        └── ml-gateway.json     # Pre-built monitoring dashboard

Quick Start

Prerequisites

  • Python 3.9+
  • Docker & Docker Compose
  • ~10GB disk space (for Triton image)

1. Clone and Setup

git clone https://github.com/yourusername/Serving-ResNet.git
cd Serving-ResNet

# Create virtual environment
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

2. Export Models

cd models
python models.py
cd ..

This downloads pretrained weights and exports ResNet18, ResNet50, and EfficientNet-B0 to TorchScript format.

3. Start Infrastructure

docker compose up -d

This launches:

Service Port Description
Triton 8000, 8001, 8002 Model inference server
Redis 6379 Caching layer
Prometheus 9090 Metrics collection
Grafana 3000 Visualization (admin/admin)

4. Start Gateway

python gateway/gateway.py

Gateway runs on http://localhost:8080

5. Test It!

# Single prediction
curl -X POST http://localhost:8080/predict \
  -F "file=@cat.jpg" \
  -F "complexity=simple"

# Health check
curl http://localhost:8080/health

# List models
curl http://localhost:8080/models

# View metrics
curl http://localhost:8080/metrics

API Reference

POST /predict

Classify an image using the selected model.

Parameters:

Name Type Default Description
file File required Image file (JPEG, PNG)
complexity string "medium" Model selection: simple, medium, complex
user_cache bool true Enable/disable caching

Model Selection:

Complexity Model Speed Accuracy
simple ResNet18 Fastest Good
medium EfficientNet-B0 Balanced Better
complex ResNet50 Slower Best

Response:

{
  "model_used": "resnet18",
  "top_prediction": 285,
  "confidence": 0.892,
  "top5_classes": [285, 281, 282, 287, 286],
  "inference_latency_ms": 45.2,
  "total_latency_ms": 52.1,
  "cache_hit": false
}

GET /health

Check service health.

{
  "status": "healthy",
  "triton": "ok",
  "redis": "ok"
}

GET /models

List available models.

{
  "models": ["resnet18", "resnet50", "efficientnet"],
  "count": 3
}

GET /metrics

Prometheus metrics endpoint.

Performance Benchmarking

Direct Triton Benchmark

Test raw Triton inference performance:

cd benchmark
python benchmark.py

Sample output:

ResNet18:  P50: 12ms  P95: 18ms  Throughput: 85 req/s
ResNet50:  P50: 35ms  P95: 52ms  Throughput: 28 req/s
EfficientNet: P50: 25ms P95: 38ms Throughput: 40 req/s

Gateway Load Test

Test end-to-end performance with caching:

cd gateway
locust -f load_test.py --host=http://localhost:8080 --headless -u 50 -r 5 -t 60s

Parameters:

  • -u 50: 50 concurrent users
  • -r 5: Spawn 5 users per second
  • -t 60s: Run for 60 seconds

Monitoring

Grafana Dashboard

Access Grafana at http://localhost:3000 (admin/admin).

The pre-configured dashboard includes:

Panel Description
Request Rate Requests per second
Active Requests Current in-flight requests
Cache Hit Rate % Percentage of cached responses
Error Rate Errors per second
Gateway Latency P50/P95/P99 latency
Inference Latency by Model Per-model Triton latency
Cache Hits vs Misses Cache performance over time
Errors by Type Error breakdown

Prometheus Metrics

Metric Type Description
gateway_requests_total Counter Total requests by endpoint and status
gateway_request_latency_seconds Histogram End-to-end latency
gateway_active_requests Gauge Current concurrent requests
triton_inference_latency_seconds Histogram Model inference time
cache_hits_total Counter Cache hits by model
cache_misses_total Counter Cache misses by model
predict_errors_total Counter Errors by model and type

Key Concepts for Beginners

1. Model Serving with Triton

Triton Inference Server handles:

  • Model loading and versioning
  • Batch scheduling
  • Multiple framework support (PyTorch, TensorFlow, ONNX)
  • GPU/CPU optimization

2. Caching Strategy

We use content-based caching:

cache_key = hash(image_bytes) + complexity_level

Benefits:

  • Same image → instant response
  • Reduces GPU load
  • 1-hour TTL prevents stale results

3. Intelligent Routing

Route requests based on requirements:

model_map = {
    "simple": "resnet18",     # 18 layers, fast
    "medium": "efficientnet", # Efficient architecture
    "complex": "resnet50"     # 50 layers, accurate
}

4. Observability

The four pillars we implement:

  • Metrics: Prometheus counters, histograms, gauges
  • Visualization: Grafana dashboards
  • Health Checks: Liveness and readiness probes
  • Load Testing: Locust for performance validation

Troubleshooting

Triton won't start

# Check if models are exported
ls models/model_repository/resnet18/1/model.pt

# Check Triton logs
docker compose logs triton

No data in Grafana

# Verify gateway is running
curl http://localhost:8080/metrics

# Check Prometheus targets
open http://localhost:9090/targets

Connection refused errors

# Ensure all services are running
docker compose ps

# Gateway must run on host, not in Docker
python gateway/gateway.py

Contributing

Contributions are welcome! This project is designed to be beginner-friendly.

Ideas for improvement:

  • Add more model architectures
  • Implement A/B testing
  • Add authentication
  • Support batch inference
  • Add model warm-up

License

MIT License - feel free to use this for learning and production.


Built for learning. Ready for production.

If this project helped you understand ML inference systems, please give it a star!

About

The complete ML inference stack — from model to metrics. Fit for beginner to seniors

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages