ML Inference Platform

A production-ready, high-performance machine learning inference platform built for learning and experimentation. This project demonstrates best practices for deploying ML models at scale with intelligent routing, caching, and comprehensive monitoring.

Perfect for beginners who want to understand how real-world ML inference systems work.

Features

Multi-Model Serving - Deploy ResNet18, ResNet50, and EfficientNet simultaneously via NVIDIA Triton Inference Server
Intelligent Routing - Automatically select models based on complexity requirements (speed vs accuracy trade-off)
Redis Caching - Avoid redundant inference calls with content-based caching
Prometheus Metrics - Full observability with request rates, latencies, cache hit rates, and error tracking
Grafana Dashboard - Pre-configured visualizations, ready out-of-the-box
Load Testing - Built-in Locust scripts for performance benchmarking

Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                              Client Request                                  │
│                            (Image + Complexity)                              │
└─────────────────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                           FastAPI Gateway (:8080)                            │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────┐ │
│  │   /predict  │  │   /health   │  │   /models   │  │      /metrics       │ │
│  └─────────────┘  └─────────────┘  └─────────────┘  └─────────────────────┘ │
│         │                                                      │             │
│         ▼                                                      │             │
│  ┌─────────────────────────────────────────────┐               │             │
│  │          Intelligent Model Router           │               │             │
│  │  ┌─────────┐  ┌─────────┐  ┌─────────────┐  │               │             │
│  │  │ simple  │  │ medium  │  │   complex   │  │               │             │
│  │  │ResNet18 │  │EffNet-B0│  │  ResNet50   │  │               │             │
│  │  │ (fast)  │  │(balanced)│  │ (accurate) │  │               │             │
│  │  └─────────┘  └─────────┘  └─────────────┘  │               │             │
│  └─────────────────────────────────────────────┘               │             │
│         │                                                      │             │
└─────────┼──────────────────────────────────────────────────────┼─────────────┘
          │                                                      │
          ▼                                                      ▼
┌───────────────────┐                                 ┌───────────────────────┐
│   Redis Cache     │                                 │     Prometheus        │
│     (:6379)       │                                 │       (:9090)         │
│                   │                                 │                       │
│  ┌─────────────┐  │                                 │  • Request Count      │
│  │ Hash-based  │  │                                 │  • Latency Histogram  │
│  │   Caching   │  │                                 │  • Cache Hit Rate     │
│  │  (1h TTL)   │  │                                 │  • Error Rate         │
│  └─────────────┘  │                                 │  • Active Requests    │
└───────────────────┘                                 └───────────┬───────────┘
                                                                  │
          │                                                       ▼
          ▼                                           ┌───────────────────────┐
┌─────────────────────────────────────────────┐       │       Grafana         │
│         Triton Inference Server             │       │       (:3000)         │
│              (:8000/8001/8002)              │       │                       │
│                                             │       │  ┌─────────────────┐  │
│  ┌───────────┐ ┌───────────┐ ┌───────────┐  │       │  │   ML Gateway    │  │
│  │ ResNet18  │ │ ResNet50  │ │EffNet-B0  │  │       │  │   Dashboard     │  │
│  │  model.pt │ │  model.pt │ │  model.pt │  │       │  └─────────────────┘  │
│  └───────────┘ └───────────┘ └───────────┘  │       │                       │
│                                             │       │  Pre-configured with  │
│         PyTorch TorchScript Models          │       │  9 visualization      │
│                                             │       │  panels               │
└─────────────────────────────────────────────┘       └───────────────────────┘

Project Structure

Serving-ResNet/
├── docker-compose.yml          # One-command infrastructure setup
├── requirements.txt            # Python dependencies
├── prometheus.yml              # Prometheus scrape configuration
├── cat.jpg                     # Sample test image
│
├── models/
│   ├── models.py               # Export PyTorch models to TorchScript
│   └── model_repository/       # Triton model repository
│       ├── resnet18/
│       │   ├── config.pbtxt    # Model configuration
│       │   └── 1/model.pt      # TorchScript model (generated)
│       ├── resnet50/
│       │   ├── config.pbtxt
│       │   └── 1/model.pt
│       └── efficientnet/
│           ├── config.pbtxt
│           └── 1/model.pt
│
├── gateway/
│   ├── gateway.py              # FastAPI inference gateway
│   └── load_test.py            # Locust load testing script
│
├── benchmark/
│   └── benchmark.py            # Direct Triton benchmarking tool
│
└── grafana/
    ├── provisioning/
    │   ├── datasources/        # Auto-configure Prometheus
    │   └── dashboards/         # Auto-load dashboards
    └── dashboards/
        └── ml-gateway.json     # Pre-built monitoring dashboard

Quick Start

Prerequisites

Python 3.9+
Docker & Docker Compose
~10GB disk space (for Triton image)

1. Clone and Setup

git clone https://github.com/yourusername/Serving-ResNet.git
cd Serving-ResNet

# Create virtual environment
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

2. Export Models

cd models
python models.py
cd ..

This downloads pretrained weights and exports ResNet18, ResNet50, and EfficientNet-B0 to TorchScript format.

3. Start Infrastructure

docker compose up -d

This launches:

Service	Port	Description
Triton	8000, 8001, 8002	Model inference server
Redis	6379	Caching layer
Prometheus	9090	Metrics collection
Grafana	3000	Visualization (admin/admin)

4. Start Gateway

python gateway/gateway.py

Gateway runs on http://localhost:8080

5. Test It!

# Single prediction
curl -X POST http://localhost:8080/predict \
  -F "file=@cat.jpg" \
  -F "complexity=simple"

# Health check
curl http://localhost:8080/health

# List models
curl http://localhost:8080/models

# View metrics
curl http://localhost:8080/metrics

API Reference

POST /predict

Classify an image using the selected model.

Parameters:

Name	Type	Default	Description
file	File	required	Image file (JPEG, PNG)
complexity	string	"medium"	Model selection: `simple`, `medium`, `complex`
user_cache	bool	true	Enable/disable caching

Model Selection:

Complexity	Model	Speed	Accuracy
simple	ResNet18	Fastest	Good
medium	EfficientNet-B0	Balanced	Better
complex	ResNet50	Slower	Best

Response:

{
  "model_used": "resnet18",
  "top_prediction": 285,
  "confidence": 0.892,
  "top5_classes": [285, 281, 282, 287, 286],
  "inference_latency_ms": 45.2,
  "total_latency_ms": 52.1,
  "cache_hit": false
}

GET /health

Check service health.

{
  "status": "healthy",
  "triton": "ok",
  "redis": "ok"
}

GET /models

List available models.

{
  "models": ["resnet18", "resnet50", "efficientnet"],
  "count": 3
}

GET /metrics

Prometheus metrics endpoint.

Performance Benchmarking

Direct Triton Benchmark

Test raw Triton inference performance:

cd benchmark
python benchmark.py

Sample output:

ResNet18:  P50: 12ms  P95: 18ms  Throughput: 85 req/s
ResNet50:  P50: 35ms  P95: 52ms  Throughput: 28 req/s
EfficientNet: P50: 25ms P95: 38ms Throughput: 40 req/s

Gateway Load Test

Test end-to-end performance with caching:

cd gateway
locust -f load_test.py --host=http://localhost:8080 --headless -u 50 -r 5 -t 60s

Parameters:

-u 50: 50 concurrent users
-r 5: Spawn 5 users per second
-t 60s: Run for 60 seconds

Monitoring

Grafana Dashboard

Access Grafana at http://localhost:3000 (admin/admin).

The pre-configured dashboard includes:

Panel	Description
Request Rate	Requests per second
Active Requests	Current in-flight requests
Cache Hit Rate %	Percentage of cached responses
Error Rate	Errors per second
Gateway Latency	P50/P95/P99 latency
Inference Latency by Model	Per-model Triton latency
Cache Hits vs Misses	Cache performance over time
Errors by Type	Error breakdown

Prometheus Metrics

Metric	Type	Description
`gateway_requests_total`	Counter	Total requests by endpoint and status
`gateway_request_latency_seconds`	Histogram	End-to-end latency
`gateway_active_requests`	Gauge	Current concurrent requests
`triton_inference_latency_seconds`	Histogram	Model inference time
`cache_hits_total`	Counter	Cache hits by model
`cache_misses_total`	Counter	Cache misses by model
`predict_errors_total`	Counter	Errors by model and type

Key Concepts for Beginners

1. Model Serving with Triton

Triton Inference Server handles:

Model loading and versioning
Batch scheduling
Multiple framework support (PyTorch, TensorFlow, ONNX)
GPU/CPU optimization

2. Caching Strategy

We use content-based caching:

cache_key = hash(image_bytes) + complexity_level

Benefits:

Same image → instant response
Reduces GPU load
1-hour TTL prevents stale results

3. Intelligent Routing

Route requests based on requirements:

model_map = {
    "simple": "resnet18",     # 18 layers, fast
    "medium": "efficientnet", # Efficient architecture
    "complex": "resnet50"     # 50 layers, accurate
}

4. Observability

The four pillars we implement:

Metrics: Prometheus counters, histograms, gauges
Visualization: Grafana dashboards
Health Checks: Liveness and readiness probes
Load Testing: Locust for performance validation

Troubleshooting

Triton won't start

# Check if models are exported
ls models/model_repository/resnet18/1/model.pt

# Check Triton logs
docker compose logs triton

No data in Grafana

# Verify gateway is running
curl http://localhost:8080/metrics

# Check Prometheus targets
open http://localhost:9090/targets

Connection refused errors

# Ensure all services are running
docker compose ps

# Gateway must run on host, not in Docker
python gateway/gateway.py

Contributing

Contributions are welcome! This project is designed to be beginner-friendly.

Ideas for improvement:

License

MIT License - feel free to use this for learning and production.

Built for learning. Ready for production.

If this project helped you understand ML inference systems, please give it a star!

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
benchmark		benchmark
gateway		gateway
grafana		grafana
models		models
.gitignore		.gitignore
README.md		README.md
cat.jpg		cat.jpg
docker-compose.yml		docker-compose.yml
prometheus.yml		prometheus.yml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

ML Inference Platform

Features

Architecture

Project Structure

Quick Start

Prerequisites

1. Clone and Setup

2. Export Models

3. Start Infrastructure

4. Start Gateway

5. Test It!

API Reference

POST /predict

GET /health

GET /models

GET /metrics

Performance Benchmarking

Direct Triton Benchmark

Gateway Load Test

Monitoring

Grafana Dashboard

Prometheus Metrics

Key Concepts for Beginners

1. Model Serving with Triton

2. Caching Strategy

3. Intelligent Routing

4. Observability

Troubleshooting

Triton won't start

No data in Grafana

Connection refused errors

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages