This repository provides a streamlined ML infrastructure environment using Minikube. It includes:
- Training the ResNet18 model on the MNIST dataset.
- Deploying the trained model for inference using Triton Inference Server.
- Kubernetes Helm charts to manage basic resources.
- Automation scripts to facilitate setup.
- Minikube installed on your local machine.
- Podman and cri-o installed and configured (instead of Docker).
- Helm and Helmfile installed.
- Just installed for task automation.
First, ensure Minikube is configured to use cri-o
as the container runtime. Follow the setup instructions in the Justfile
tasks.
helmfile.yaml
: Root Helmfile to manage multiple Helm charts.bases/
: Base configuration files.releases/
: Release configurations for components like NFS, Ingress, Seldon Core, and training jobs.charts/
: Helm charts and Python scripts for the ResNet18 training job and Triton client.hack/
: Helper scripts for setting up the environment, including NFS, DNS, and Docker registry..env
,.env.minikube
: Environment variable files.
-
Install Podman and cri-o
just podman just crio
-
Start Minikube
just k8s-up just k8s-down # if there's a problem
-
Install NFS server and client
just nfs
-
Update /etc/hosts on the host and Minikube, and patch Corefile of CoreDNS
just dns
-
Prepare private registry on the host and crio in Minikube
just registry
-
Build the training job
just docker-build just docker-run just docker-push
-
Build the Triton client
just docker-build triton-client just docker-run triton-client just docker-push triton-client
-
Select the environment
just env minikube
-
Deploy all Helm charts at once
just apply
-
(Optional) Deploy Helm charts by label
just apply tier=common # Storage Class, Priority Class just apply tier=ops # NFS, Ingress just apply tier=ml # Seldon Core Operator, Triton Server just apply tier=train # Training Job just apply tier=client # Triton Client
- Automated CI/CD/CT Pipeline
- Drift Monitoring and Logging
- Scalability Improvements using Seldon
- Model Versioning and Management using DVC
- MetalLB Integration to handle external traffic
- Distributed Training with Ray or DistributedBackend