Stars
Kubernetes-native Job Queueing
Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)
Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
A tool to detect infrastructure issues on cloud native AI systems
A reactive notebook for Python — run reproducible experiments, execute as a script, deploy as an app, and version with git.
xpk (Accelerated Processing Kit, pronounced x-p-k,) is a software tool to help Cloud developers to orchestrate training jobs on accelerators such as TPUs and GPUs on GKE.
Distribute and run LLMs with a single file.
A high-throughput and memory-efficient inference and serving engine for LLMs
Heterogeneous AI Computing Virtualization Middleware
Get up and running with Llama 3.3, DeepSeek-R1, Phi-4, Gemma 2, and other large language models.
Example DRA driver that developers can fork and modify to get them started writing their own.
batch-simulator is a Golang CLI tool that simulates the lifecycle of Kubernetes API resources, such as Nodes, Pods, etc. using KWOK
LeaderWorkerSet: An API for deploying a group of pods as a unit of replication
Configuration data used to build OCP images
A unified tool for collecting system logs and other debug information
This is the public roadmap for AWS container services (ECS, ECR, Fargate, and EKS).
JobSet: a k8s native API for distributed ML training and HPC workloads
A multi-cluster batch queuing system for high-throughput workloads on Kubernetes.
📕 Clarity in the current fast-paced mess of Open Source innovation
A multi-sandbox container runtime that provides cloud-native, all-scenario multiple sandbox container solutions.
CLI and validation tools for Kubelet Container Runtime Interface (CRI) .
Core components in the OCM project. Report here if you found any issues in OCM.