Status: 75% Complete | Project Phase: 5/6 | Tests: 94/94 β PASSING
Netpilot is an autonomous agent system that diagnoses and remediates failures in microservices running on Kubernetes. It uses LLM-guided diagnosis, policy-based validation, and automated remediation to maintain system SLAs.
Kubernetes Cluster
βββ Services (5 microservices with metrics)
βββ Prometheus (metrics collection & alert rules)
βββ Alertmanager (alert routing & webhook)
β
TelemetryCollector (async KPI + log + alarm collection)
β
AgentPipeline (LLM diagnosis + action ranking)
β
PolicyGate (SLA validation, blast radius checking)
β
Executor (remediation actions via kubectl)
β
Evaluation Harness (MTTR, FPR, SLA metrics)
- π Intelligent Diagnosis: LLM-powered root cause analysis of Kubernetes failures
- β‘ Fast Recovery: Automatic remediation with Mean Time To Recovery (MTTR) tracking
- π‘οΈ Policy-Gated: Actions validated against SLA bounds and blast radius limits
- π Comprehensive Evaluation: Metrics for action accuracy, SLA compliance, recovery time
- π Safety First: Multiple validation layers before executing kubectl commands
- Python 3.13+
- Kubernetes cluster (Kind or cloud-based)
- OpenAI API key (or Anthropic)
- Prometheus & Alertmanager running
# Clone repository
git clone https://github.com/yourusername/netpilot.git
cd netpilot
# Install dependencies
pip install -r requirements.txt
# Set environment variables
export OPENAI_API_KEY="your-api-key"
export PROMETHEUS_URL="http://localhost:9090"
export ALERTMANAGER_URL="http://localhost:5000"# Terminal 1: Set up Kind cluster with monitoring
cd sim/cluster
bash monitoring/deploy.sh
# Terminal 2: Run Netpilot agent
cd netpilot
OPENAI_API_KEY=your-key python main.py
# Terminal 3: Inject failures
python sim/fault_injector.py --scenario pod-crash --target notification-service
# Watch agent respond in Terminal 2 with diagnosis and remediation# Generate evaluation report
python -m eval.report --detailed
# Output:
# ======================================================================
# NETPILOT EVALUATION REPORT
# ======================================================================
#
# Metric Value
# Mean Time To Recovery (MTTR) 45.5s
# False-Positive Rate 0.0% (0/3)
# SLA Violation Rate 0.0% (0/3)netpilot/
βββ AGENTS.md β Project status and architecture
βββ README.md β This file
βββ requirements.txt β Python dependencies
βββ config.py β Central configuration
βββ main.py β Entrypoint for continuous operation
β
βββ sim/ β Simulation infrastructure
β βββ cluster/
β β βββ kind-config.yaml β Kind cluster configuration
β β βββ services/ β 5 microservices with Prometheus metrics
β β βββ monitoring/ β Prometheus + Alertmanager + Alert Receiver
β βββ fault_injector.py β CLI tool for fault injection
β
βββ telemetry/ β Telemetry collection & formatting
β βββ collector.py β Main collector (KPIs, logs, alarms)
β βββ formatter.py β Output formatting (JSON, Markdown, context-window)
β βββ schemas.py β Pydantic models (LogEvent, KPI, Alarm, TelemetryBundle)
β βββ test_*.py β Unit tests
β
βββ agent/ β LLM-based agent pipeline
β βββ pipeline.py β Main agent loop (ingest β diagnose β rank)
β βββ prompts.py β System prompt + few-shot examples
β βββ models.py β Pydantic models (DiagnosisResult, RemediationAction)
β βββ test_*.py β Unit tests
β
βββ policy/ β Policy validation gate
β βββ gate.py β PolicyGate validation engine
β βββ invariants.py β SLA bounds, rollback registry, blast radius
β βββ tests/test_gate.py β PolicyGate tests
β βββ test_*.py β Invariants tests
β
βββ executor/ β Remediation action execution
β βββ remediation.py β Maps actions to kubectl commands
β βββ test_*.py β Unit tests
β
βββ eval/ β Evaluation harness & metrics
βββ harness.py β Scenario runner
βββ report.py β Report generator
βββ scenarios/ β YAML scenario definitions
βββ test_*.py β Unit tests
Edit config.py or set environment variables:
# LLM Configuration
NETPILOT_LLM_PROVIDER=openai # or "anthropic"
NETPILOT_LLM_MODEL=gpt-4 # or "gpt-4-turbo", "claude-3-opus"
OPENAI_API_KEY=your-api-key
ANTHROPIC_API_KEY=your-api-key
# Telemetry Configuration
PROMETHEUS_URL=http://localhost:9090
ALERTMANAGER_URL=http://localhost:5000
NETPILOT_COLLECTION_INTERVAL=30 # seconds
# Executor Configuration
KUBECONFIG=~/.kube/config
NETPILOT_EXECUTION_TIMEOUT=60 # seconds
# Logging
NETPILOT_LOG_DIR=logs
NETPILOT_LOG_LEVEL=INFO # or DEBUG, WARNING, ERROR
NETPILOT_DEBUG=false# Run full test suite (94 tests across all modules)
pytest -v
# Run specific module tests
pytest agent/ -v # Agent pipeline tests
pytest policy/ -v # Policy gate tests
pytest executor/ -v # Executor tests
pytest telemetry/ -v # Telemetry tests
pytest eval/ -v # Evaluation harness tests# Run scenario suite (pod crash, link degradation, cascade)
python -m eval.harness
# Generate detailed report
python -m eval.report --detailed
# View individual scenario results
python -m eval.report --results-dir eval/results/- Time from failure detection to SLA compliance recovery
- Lower is better (target: < 60 seconds)
- Typical range: 30-120 seconds
- Percentage of remediation actions that were incorrect
- Lower is better (target: 0%)
- Formula:
(wrong_actions / total_actions) Γ 100%
- Percentage of scenarios where SLA was breached during recovery
- Lower is better (target: 0%)
- Formula:
(scenarios_with_violations / total_scenarios) Γ 100%
# Start continuous monitoring and diagnosis
python main.py
# With custom configuration
NETPILOT_LOG_LEVEL=DEBUG PROMETHEUS_URL=http://custom:9090 python main.py
# With limited iterations (for testing)
python main.py --iterations 10
# Dry-run mode (diagnose only, don't execute)
NETPILOT_DRY_RUN=true python main.py# Run single scenario
python -c "
import asyncio
from eval.harness import run_scenario
result = asyncio.run(run_scenario('01-notification-crash.yaml'))
print(f'MTTR: {result.mttr_seconds}s')
print(f'Success: {result.success}')
"
# Run full suite
python -c "
import asyncio
from eval.harness import run_scenario_suite, save_results
results, metrics = asyncio.run(run_scenario_suite([
'01-notification-crash.yaml',
'02-inventory-degrade.yaml',
'03-order-cascade.yaml'
]))
save_results(results, metrics)
"- AGENTS.md - Comprehensive project status and architecture
- PHASE5_COMPLETION.md - Evaluation harness implementation details
- EXECUTOR_INTEGRATION.md - End-to-end remediation flow
- sim/cluster/DEPLOYMENT.md - Kubernetes cluster setup
- sim/FAULT_INJECTOR.md - Fault injection scenarios
- telemetry/README.md - Telemetry collection API
- agent/README.md - Agent pipeline documentation
- policy/GATE_GUIDE.md - Policy gate validation
- executor/README.md - Remediation action execution
- eval/REPORT.md - Evaluation and reporting
from telemetry.collector import TelemetryCollector
# Collect metrics
async def monitor():
collector = TelemetryCollector()
bundle = await collector.collect()
print(f"Services: {bundle.services_monitored}")
for service, kpi in bundle.kpis.items():
print(f"{service}: {kpi.error_rate:.1%} errors, {kpi.latency_p99_ms}ms p99")
import asyncio
asyncio.run(monitor())from agent.pipeline import AgentPipeline
from policy.gate import PolicyGate
from executor.remediation import execute
from config import get_config
config = get_config()
async def diagnose_and_fix():
# Get diagnosis
agent = AgentPipeline(config.llm)
diagnosis = await agent.diagnose(telemetry_context)
if diagnosis:
print(f"Root cause: {diagnosis.root_cause}")
# Validate action
gate = PolicyGate()
for action in diagnosis.remediation_actions:
allowed, reason = gate.validate(action, kpis)
if allowed:
# Execute
result = execute(action)
print(f"Action executed: {result.success}")
break# After running scenarios
python -m eval.report --detailed
# Programmatically
from eval.report import load_results, calculate_metrics, print_table
results = load_results()
metrics = calculate_metrics(results)
print_table(metrics)
# Access specific metrics
print(f"Average MTTR: {metrics['mean_mttr_seconds']:.1f}s")
print(f"False-Positive Rate: {metrics['false_positive_rate']:.1%}")
print(f"SLA Violation Rate: {metrics['sla_violation_rate']:.1%}")# Check Prometheus is running
kubectl get pod -n monitoring prometheus
# Port-forward if needed
kubectl port-forward -n monitoring svc/prometheus 9090:9090export OPENAI_API_KEY="sk-..."
python main.pyCheck SLA bounds are within cluster capabilities:
from policy.invariants import SLA_BOUNDS, print_sla_bounds
print_sla_bounds()| Phase | Component | Status | Tests | Coverage |
|---|---|---|---|---|
| 1 | Simulation (Kind, services, fault injector) | β Complete | - | - |
| 2 | Telemetry (collector, formatter, schemas) | β Complete | - | - |
| 3 | Policy Gate (validation, invariants) | β Complete | 36/36 | 100% |
| 4 | Executor (remediation actions) | β Complete | 18/18 | 100% |
| 5 | Evaluation (harness, report) | β Complete | 26/26 | 100% |
| 6 | Configuration & Entrypoint | π In Progress | 1/2 | 50% |
Overall: 75% Complete | Total Tests: 94/94 β PASSING
- Follow project structure in AGENTS.md
- Add unit tests for all new code
- Run
pytestto verify all tests pass - Update documentation with changes
MIT License - See LICENSE file for details
- Issues: GitHub Issues
- Documentation: See
/docsand phase completion files - Examples: See
examples/directory
Last Updated: 2026-04-27 Project Status: 75% Complete (Phase 5 complete, Phase 6 in progress) Next Phase: Configuration integration and final testing