Welcome to the Gemma-4 DevOps Agents workspace. This repository contains three specialized, self-hosted AI-driven DevOps/SRE agents powered by Google's Gemma 4 model. These agents are packaged as Model Context Protocol (MCP) servers to analyze, monitor, and troubleshoot infrastructure components.
This workspace is organized into five distinct sub-agents, each tailored to a specific environment and serving stack:
| Sub-Agent | Purpose | Serving Engine | Target Infrastructure |
|---|---|---|---|
| Local DevOps Agent | CPU/GPU local analysis & prototyping | Ollama / vLLM | Local Docker / Workstations |
| GPU DevOps Agent (26B) | Serverless GPU-accelerated cloud analysis (26B config) | vLLM | Google Cloud Run (us-central1) |
| GPU DevOps Agent (6000) | Serverless GPU-accelerated cloud analysis (RTX 6000) | vLLM | Google Cloud Run (us-central1) |
| GPU DevOps Agent (vLLM) | Serverless GPU-accelerated cloud analysis (L4 GPU) | vLLM | Google Cloud Run (us-east4) |
| TPU DevOps Agent | Ultra-high performance enterprise log & infra analysis | vLLM | Google Cloud TPUs (v6e Trillium) |
- Automated SRE Diagnostics: Fetches and reviews system, container, and Cloud Logging entries using Gemma 4 to identify root causes and generate 3-step remediation plans.
- Serving Stack Control: Built-in tools to provision, start, stop, restart, and scale your vLLM and Ollama containers or Cloud TPU Queued Resources.
- Observability Dashboards: Real-time dashboards monitoring HBM usage, Tensor Core pressure, Prometheus metrics, and service latencies.
- Model Benchmarking: Tools to run load tests and vLLM's internal benchmark suites, returning performance metrics (TTFT, throughput, P95 latency).
- Gemini CLI Integration: Custom setup instructions using a LiteLLM Proxy to route standard Gemini CLI commands directly to your private, self-hosted Gemma 4 instance.
A root Makefile is provided to manage the sub-agents collectively:
- Help / Display commands:
make all
- Install dependencies in all subdirectories:
make install
- Run tests across all agents:
make test - Lint all Python directories:
make lint
- Clean build/cache folders:
make clean
- Role: Specialized SRE specialized in local containerized workloads.
- Inference Stack: Runs
gemma4:e2borgoogle/gemma-4-E2B-itvia local Docker (ollama/ollamaor CPU/GPU vLLM). - Key Tools:
- manage_docker: Manage the local container.
- analyze_local_logs: Automated log diagnostic reports.
- query_gemma4_with_stats: Measure local inference latency and throughput.
- get_help: Retrieve server configuration and tool details.
- Documentation: See local-devops-agent/README.md and local-devops-agent/GEMINI.md.
- Role: Cloud-based SRE managing GPU-accelerated serverless endpoints (26B configuration).
- Inference Stack: Runs
google/gemma-4-26B-A4B-itvia vLLM on GCP Cloud Run (RTX 6000 GPU in us-central1). - Key Tools:
- deploy_vllm: Automates serverless Cloud Run GPU vLLM deployments.
- analyze_cloud_logging: Summarizes Google Cloud Logging errors.
- get_vllm_deployment_config: Generates
gcloudconfiguration options. - get_help: Retrieve server configuration and tool details.
- Documentation: See gpu-26B-devops-agent/README.md.
- Role: Cloud-based SRE managing GPU-accelerated serverless endpoints (RTX 6000 config).
- Inference Stack: Runs
google/gemma-4-26B-A4B-itvia vLLM on GCP Cloud Run (RTX 6000 GPU in us-central1). - Key Tools:
- deploy_vllm: Automates serverless Cloud Run GPU vLLM deployments.
- analyze_cloud_logging: Summarizes Google Cloud Logging errors.
- get_vllm_deployment_config: Generates
gcloudconfiguration options. - get_help: Retrieve server configuration and tool details.
- Documentation: See gpu-6000-devops-agent/README.md.
- Role: Cloud-based SRE managing GPU-accelerated serverless endpoints (L4 configuration).
- Inference Stack: Runs
google/gemma-4-E4B-itvia vLLM on GCP Cloud Run (NVIDIA L4 GPU in us-east4). - Key Tools:
- deploy_vllm: Automates serverless Cloud Run GPU vLLM deployments.
- analyze_cloud_logging: Summarizes Google Cloud Logging errors.
- get_vllm_deployment_config: Generates
gcloudconfiguration options. - get_help: Retrieve server configuration and tool details.
- Documentation: See gpu-vllm-devops-agent/README.md.
- Role: High-performance TPU SRE/DevOps managing large-scale private clusters.
- Inference Stack: Runs
google/gemma-4-31B-itvia vLLM on Google Cloud TPUs (v6e Trillium / Flex-start VMs). - Key Tools:
- manage_queued_resource: Manage the TPU Queued Resource (create, check, etc.).
- run_vllm_benchmark: Run performance benchmark on TPU.
- query_queued_gemma4_with_stats: Query model on TPU and measure latency/throughput.
- get_help: Retrieve server configuration and tool details.
- Documentation: See tpu-vllm-devops-agent/README.md and tpu-vllm-devops-agent/GEMINI.md.
When deploying to Google Cloud or Hugging Face, secure credentials using:
- Hugging Face Access Token: Saved locally or to Google Secret Manager via
save_hf_tokentools. - Application Default Credentials (ADC): Set up using GCP credentials helper scripts (
set_adc.shinside individual sub-agent folders).