A production-ready, high-performance machine learning inference platform built for learning and experimentation. This project demonstrates best practices for deploying ML models at scale with intelligent routing, caching, and comprehensive monitoring.
Perfect for beginners who want to understand how real-world ML inference systems work.
- Multi-Model Serving - Deploy ResNet18, ResNet50, and EfficientNet simultaneously via NVIDIA Triton Inference Server
- Intelligent Routing - Automatically select models based on complexity requirements (speed vs accuracy trade-off)
- Redis Caching - Avoid redundant inference calls with content-based caching
- Prometheus Metrics - Full observability with request rates, latencies, cache hit rates, and error tracking
- Grafana Dashboard - Pre-configured visualizations, ready out-of-the-box
- Load Testing - Built-in Locust scripts for performance benchmarking
┌─────────────────────────────────────────────────────────────────────────────┐
│ Client Request │
│ (Image + Complexity) │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ FastAPI Gateway (:8080) │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │
│ │ /predict │ │ /health │ │ /models │ │ /metrics │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────────────┘ │
│ │ │ │
│ ▼ │ │
│ ┌─────────────────────────────────────────────┐ │ │
│ │ Intelligent Model Router │ │ │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────────┐ │ │ │
│ │ │ simple │ │ medium │ │ complex │ │ │ │
│ │ │ResNet18 │ │EffNet-B0│ │ ResNet50 │ │ │ │
│ │ │ (fast) │ │(balanced)│ │ (accurate) │ │ │ │
│ │ └─────────┘ └─────────┘ └─────────────┘ │ │ │
│ └─────────────────────────────────────────────┘ │ │
│ │ │ │
└─────────┼──────────────────────────────────────────────────────┼─────────────┘
│ │
▼ ▼
┌───────────────────┐ ┌───────────────────────┐
│ Redis Cache │ │ Prometheus │
│ (:6379) │ │ (:9090) │
│ │ │ │
│ ┌─────────────┐ │ │ • Request Count │
│ │ Hash-based │ │ │ • Latency Histogram │
│ │ Caching │ │ │ • Cache Hit Rate │
│ │ (1h TTL) │ │ │ • Error Rate │
│ └─────────────┘ │ │ • Active Requests │
└───────────────────┘ └───────────┬───────────┘
│
│ ▼
▼ ┌───────────────────────┐
┌─────────────────────────────────────────────┐ │ Grafana │
│ Triton Inference Server │ │ (:3000) │
│ (:8000/8001/8002) │ │ │
│ │ │ ┌─────────────────┐ │
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ │ │ ML Gateway │ │
│ │ ResNet18 │ │ ResNet50 │ │EffNet-B0 │ │ │ │ Dashboard │ │
│ │ model.pt │ │ model.pt │ │ model.pt │ │ │ └─────────────────┘ │
│ └───────────┘ └───────────┘ └───────────┘ │ │ │
│ │ │ Pre-configured with │
│ PyTorch TorchScript Models │ │ 9 visualization │
│ │ │ panels │
└─────────────────────────────────────────────┘ └───────────────────────┘
Serving-ResNet/
├── docker-compose.yml # One-command infrastructure setup
├── requirements.txt # Python dependencies
├── prometheus.yml # Prometheus scrape configuration
├── cat.jpg # Sample test image
│
├── models/
│ ├── models.py # Export PyTorch models to TorchScript
│ └── model_repository/ # Triton model repository
│ ├── resnet18/
│ │ ├── config.pbtxt # Model configuration
│ │ └── 1/model.pt # TorchScript model (generated)
│ ├── resnet50/
│ │ ├── config.pbtxt
│ │ └── 1/model.pt
│ └── efficientnet/
│ ├── config.pbtxt
│ └── 1/model.pt
│
├── gateway/
│ ├── gateway.py # FastAPI inference gateway
│ └── load_test.py # Locust load testing script
│
├── benchmark/
│ └── benchmark.py # Direct Triton benchmarking tool
│
└── grafana/
├── provisioning/
│ ├── datasources/ # Auto-configure Prometheus
│ └── dashboards/ # Auto-load dashboards
└── dashboards/
└── ml-gateway.json # Pre-built monitoring dashboard
- Python 3.9+
- Docker & Docker Compose
- ~10GB disk space (for Triton image)
git clone https://github.com/yourusername/Serving-ResNet.git
cd Serving-ResNet
# Create virtual environment
python3 -m venv venv
source venv/bin/activate
# Install dependencies
pip install -r requirements.txtcd models
python models.py
cd ..This downloads pretrained weights and exports ResNet18, ResNet50, and EfficientNet-B0 to TorchScript format.
docker compose up -dThis launches:
| Service | Port | Description |
|---|---|---|
| Triton | 8000, 8001, 8002 | Model inference server |
| Redis | 6379 | Caching layer |
| Prometheus | 9090 | Metrics collection |
| Grafana | 3000 | Visualization (admin/admin) |
python gateway/gateway.pyGateway runs on http://localhost:8080
# Single prediction
curl -X POST http://localhost:8080/predict \
-F "file=@cat.jpg" \
-F "complexity=simple"
# Health check
curl http://localhost:8080/health
# List models
curl http://localhost:8080/models
# View metrics
curl http://localhost:8080/metricsClassify an image using the selected model.
Parameters:
| Name | Type | Default | Description |
|---|---|---|---|
| file | File | required | Image file (JPEG, PNG) |
| complexity | string | "medium" | Model selection: simple, medium, complex |
| user_cache | bool | true | Enable/disable caching |
Model Selection:
| Complexity | Model | Speed | Accuracy |
|---|---|---|---|
| simple | ResNet18 | Fastest | Good |
| medium | EfficientNet-B0 | Balanced | Better |
| complex | ResNet50 | Slower | Best |
Response:
{
"model_used": "resnet18",
"top_prediction": 285,
"confidence": 0.892,
"top5_classes": [285, 281, 282, 287, 286],
"inference_latency_ms": 45.2,
"total_latency_ms": 52.1,
"cache_hit": false
}Check service health.
{
"status": "healthy",
"triton": "ok",
"redis": "ok"
}List available models.
{
"models": ["resnet18", "resnet50", "efficientnet"],
"count": 3
}Prometheus metrics endpoint.
Test raw Triton inference performance:
cd benchmark
python benchmark.pySample output:
ResNet18: P50: 12ms P95: 18ms Throughput: 85 req/s
ResNet50: P50: 35ms P95: 52ms Throughput: 28 req/s
EfficientNet: P50: 25ms P95: 38ms Throughput: 40 req/s
Test end-to-end performance with caching:
cd gateway
locust -f load_test.py --host=http://localhost:8080 --headless -u 50 -r 5 -t 60sParameters:
-u 50: 50 concurrent users-r 5: Spawn 5 users per second-t 60s: Run for 60 seconds
Access Grafana at http://localhost:3000 (admin/admin).
The pre-configured dashboard includes:
| Panel | Description |
|---|---|
| Request Rate | Requests per second |
| Active Requests | Current in-flight requests |
| Cache Hit Rate % | Percentage of cached responses |
| Error Rate | Errors per second |
| Gateway Latency | P50/P95/P99 latency |
| Inference Latency by Model | Per-model Triton latency |
| Cache Hits vs Misses | Cache performance over time |
| Errors by Type | Error breakdown |
| Metric | Type | Description |
|---|---|---|
gateway_requests_total |
Counter | Total requests by endpoint and status |
gateway_request_latency_seconds |
Histogram | End-to-end latency |
gateway_active_requests |
Gauge | Current concurrent requests |
triton_inference_latency_seconds |
Histogram | Model inference time |
cache_hits_total |
Counter | Cache hits by model |
cache_misses_total |
Counter | Cache misses by model |
predict_errors_total |
Counter | Errors by model and type |
Triton Inference Server handles:
- Model loading and versioning
- Batch scheduling
- Multiple framework support (PyTorch, TensorFlow, ONNX)
- GPU/CPU optimization
We use content-based caching:
cache_key = hash(image_bytes) + complexity_level
Benefits:
- Same image → instant response
- Reduces GPU load
- 1-hour TTL prevents stale results
Route requests based on requirements:
model_map = {
"simple": "resnet18", # 18 layers, fast
"medium": "efficientnet", # Efficient architecture
"complex": "resnet50" # 50 layers, accurate
}The four pillars we implement:
- Metrics: Prometheus counters, histograms, gauges
- Visualization: Grafana dashboards
- Health Checks: Liveness and readiness probes
- Load Testing: Locust for performance validation
# Check if models are exported
ls models/model_repository/resnet18/1/model.pt
# Check Triton logs
docker compose logs triton# Verify gateway is running
curl http://localhost:8080/metrics
# Check Prometheus targets
open http://localhost:9090/targets# Ensure all services are running
docker compose ps
# Gateway must run on host, not in Docker
python gateway/gateway.pyContributions are welcome! This project is designed to be beginner-friendly.
Ideas for improvement:
- Add more model architectures
- Implement A/B testing
- Add authentication
- Support batch inference
- Add model warm-up
MIT License - feel free to use this for learning and production.
Built for learning. Ready for production.
If this project helped you understand ML inference systems, please give it a star!