Skip to content

thugscode/netpilot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

15 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ Netpilot - Self-Healing Kubernetes Agent System

Status: 75% Complete | Project Phase: 5/6 | Tests: 94/94 βœ… PASSING

Netpilot is an autonomous agent system that diagnoses and remediates failures in microservices running on Kubernetes. It uses LLM-guided diagnosis, policy-based validation, and automated remediation to maintain system SLAs.

🎯 Overview

Architecture

Kubernetes Cluster
    β”œβ”€β”€ Services (5 microservices with metrics)
    β”œβ”€β”€ Prometheus (metrics collection & alert rules)
    └── Alertmanager (alert routing & webhook)
         ↓
    TelemetryCollector (async KPI + log + alarm collection)
         ↓
    AgentPipeline (LLM diagnosis + action ranking)
         ↓
    PolicyGate (SLA validation, blast radius checking)
         ↓
    Executor (remediation actions via kubectl)
         ↓
    Evaluation Harness (MTTR, FPR, SLA metrics)

Key Capabilities

  • πŸ” Intelligent Diagnosis: LLM-powered root cause analysis of Kubernetes failures
  • ⚑ Fast Recovery: Automatic remediation with Mean Time To Recovery (MTTR) tracking
  • πŸ›‘οΈ Policy-Gated: Actions validated against SLA bounds and blast radius limits
  • πŸ“Š Comprehensive Evaluation: Metrics for action accuracy, SLA compliance, recovery time
  • πŸ” Safety First: Multiple validation layers before executing kubectl commands

πŸ“‹ Quick Start

1. Prerequisites

  • Python 3.13+
  • Kubernetes cluster (Kind or cloud-based)
  • OpenAI API key (or Anthropic)
  • Prometheus & Alertmanager running

2. Installation

# Clone repository
git clone https://github.com/yourusername/netpilot.git
cd netpilot

# Install dependencies
pip install -r requirements.txt

# Set environment variables
export OPENAI_API_KEY="your-api-key"
export PROMETHEUS_URL="http://localhost:9090"
export ALERTMANAGER_URL="http://localhost:5000"

3. Run Simulation

# Terminal 1: Set up Kind cluster with monitoring
cd sim/cluster
bash monitoring/deploy.sh

# Terminal 2: Run Netpilot agent
cd netpilot
OPENAI_API_KEY=your-key python main.py

# Terminal 3: Inject failures
python sim/fault_injector.py --scenario pod-crash --target notification-service

# Watch agent respond in Terminal 2 with diagnosis and remediation

4. View Evaluation Results

# Generate evaluation report
python -m eval.report --detailed

# Output:
# ======================================================================
# NETPILOT EVALUATION REPORT
# ======================================================================
# 
# Metric                                   Value                         
# Mean Time To Recovery (MTTR)              45.5s
# False-Positive Rate                        0.0% (0/3)
# SLA Violation Rate                         0.0% (0/3)

πŸ“ Project Structure

netpilot/
β”œβ”€β”€ AGENTS.md                          ← Project status and architecture
β”œβ”€β”€ README.md                          ← This file
β”œβ”€β”€ requirements.txt                   ← Python dependencies
β”œβ”€β”€ config.py                          ← Central configuration
β”œβ”€β”€ main.py                            ← Entrypoint for continuous operation
β”‚
β”œβ”€β”€ sim/                               ← Simulation infrastructure
β”‚   β”œβ”€β”€ cluster/
β”‚   β”‚   β”œβ”€β”€ kind-config.yaml           ← Kind cluster configuration
β”‚   β”‚   β”œβ”€β”€ services/                  ← 5 microservices with Prometheus metrics
β”‚   β”‚   └── monitoring/                ← Prometheus + Alertmanager + Alert Receiver
β”‚   └── fault_injector.py              ← CLI tool for fault injection
β”‚
β”œβ”€β”€ telemetry/                         ← Telemetry collection & formatting
β”‚   β”œβ”€β”€ collector.py                   ← Main collector (KPIs, logs, alarms)
β”‚   β”œβ”€β”€ formatter.py                   ← Output formatting (JSON, Markdown, context-window)
β”‚   β”œβ”€β”€ schemas.py                     ← Pydantic models (LogEvent, KPI, Alarm, TelemetryBundle)
β”‚   └── test_*.py                      ← Unit tests
β”‚
β”œβ”€β”€ agent/                             ← LLM-based agent pipeline
β”‚   β”œβ”€β”€ pipeline.py                    ← Main agent loop (ingest β†’ diagnose β†’ rank)
β”‚   β”œβ”€β”€ prompts.py                     ← System prompt + few-shot examples
β”‚   β”œβ”€β”€ models.py                      ← Pydantic models (DiagnosisResult, RemediationAction)
β”‚   └── test_*.py                      ← Unit tests
β”‚
β”œβ”€β”€ policy/                            ← Policy validation gate
β”‚   β”œβ”€β”€ gate.py                        ← PolicyGate validation engine
β”‚   β”œβ”€β”€ invariants.py                  ← SLA bounds, rollback registry, blast radius
β”‚   β”œβ”€β”€ tests/test_gate.py             ← PolicyGate tests
β”‚   └── test_*.py                      ← Invariants tests
β”‚
β”œβ”€β”€ executor/                          ← Remediation action execution
β”‚   β”œβ”€β”€ remediation.py                 ← Maps actions to kubectl commands
β”‚   └── test_*.py                      ← Unit tests
β”‚
└── eval/                              ← Evaluation harness & metrics
    β”œβ”€β”€ harness.py                     ← Scenario runner
    β”œβ”€β”€ report.py                      ← Report generator
    β”œβ”€β”€ scenarios/                     ← YAML scenario definitions
    └── test_*.py                      ← Unit tests

πŸ”§ Configuration

Edit config.py or set environment variables:

# LLM Configuration
NETPILOT_LLM_PROVIDER=openai        # or "anthropic"
NETPILOT_LLM_MODEL=gpt-4            # or "gpt-4-turbo", "claude-3-opus"
OPENAI_API_KEY=your-api-key
ANTHROPIC_API_KEY=your-api-key

# Telemetry Configuration
PROMETHEUS_URL=http://localhost:9090
ALERTMANAGER_URL=http://localhost:5000
NETPILOT_COLLECTION_INTERVAL=30     # seconds

# Executor Configuration
KUBECONFIG=~/.kube/config
NETPILOT_EXECUTION_TIMEOUT=60       # seconds

# Logging
NETPILOT_LOG_DIR=logs
NETPILOT_LOG_LEVEL=INFO             # or DEBUG, WARNING, ERROR
NETPILOT_DEBUG=false

πŸ§ͺ Testing

Run All Tests

# Run full test suite (94 tests across all modules)
pytest -v

# Run specific module tests
pytest agent/ -v                    # Agent pipeline tests
pytest policy/ -v                   # Policy gate tests
pytest executor/ -v                 # Executor tests
pytest telemetry/ -v                # Telemetry tests
pytest eval/ -v                     # Evaluation harness tests

Run Evaluation Scenarios

# Run scenario suite (pod crash, link degradation, cascade)
python -m eval.harness

# Generate detailed report
python -m eval.report --detailed

# View individual scenario results
python -m eval.report --results-dir eval/results/

πŸ“Š Key Metrics

Mean Time To Recovery (MTTR)

  • Time from failure detection to SLA compliance recovery
  • Lower is better (target: < 60 seconds)
  • Typical range: 30-120 seconds

False-Positive Rate (FPR)

  • Percentage of remediation actions that were incorrect
  • Lower is better (target: 0%)
  • Formula: (wrong_actions / total_actions) Γ— 100%

SLA Violation Rate (SVR)

  • Percentage of scenarios where SLA was breached during recovery
  • Lower is better (target: 0%)
  • Formula: (scenarios_with_violations / total_scenarios) Γ— 100%

πŸš€ Usage

Running the Agent

# Start continuous monitoring and diagnosis
python main.py

# With custom configuration
NETPILOT_LOG_LEVEL=DEBUG PROMETHEUS_URL=http://custom:9090 python main.py

# With limited iterations (for testing)
python main.py --iterations 10

# Dry-run mode (diagnose only, don't execute)
NETPILOT_DRY_RUN=true python main.py

Running Scenarios

# Run single scenario
python -c "
import asyncio
from eval.harness import run_scenario

result = asyncio.run(run_scenario('01-notification-crash.yaml'))
print(f'MTTR: {result.mttr_seconds}s')
print(f'Success: {result.success}')
"

# Run full suite
python -c "
import asyncio
from eval.harness import run_scenario_suite, save_results

results, metrics = asyncio.run(run_scenario_suite([
    '01-notification-crash.yaml',
    '02-inventory-degrade.yaml',
    '03-order-cascade.yaml'
]))
save_results(results, metrics)
"

πŸ“š Documentation

πŸ” Examples

Example 1: Monitor a Single Service

from telemetry.collector import TelemetryCollector

# Collect metrics
async def monitor():
    collector = TelemetryCollector()
    bundle = await collector.collect()
    
    print(f"Services: {bundle.services_monitored}")
    for service, kpi in bundle.kpis.items():
        print(f"{service}: {kpi.error_rate:.1%} errors, {kpi.latency_p99_ms}ms p99")

import asyncio
asyncio.run(monitor())

Example 2: Diagnose and Remediate

from agent.pipeline import AgentPipeline
from policy.gate import PolicyGate
from executor.remediation import execute
from config import get_config

config = get_config()

async def diagnose_and_fix():
    # Get diagnosis
    agent = AgentPipeline(config.llm)
    diagnosis = await agent.diagnose(telemetry_context)
    
    if diagnosis:
        print(f"Root cause: {diagnosis.root_cause}")
        
        # Validate action
        gate = PolicyGate()
        for action in diagnosis.remediation_actions:
            allowed, reason = gate.validate(action, kpis)
            
            if allowed:
                # Execute
                result = execute(action)
                print(f"Action executed: {result.success}")
                break

Example 3: Generate Evaluation Report

# After running scenarios
python -m eval.report --detailed

# Programmatically
from eval.report import load_results, calculate_metrics, print_table

results = load_results()
metrics = calculate_metrics(results)
print_table(metrics)

# Access specific metrics
print(f"Average MTTR: {metrics['mean_mttr_seconds']:.1f}s")
print(f"False-Positive Rate: {metrics['false_positive_rate']:.1%}")
print(f"SLA Violation Rate: {metrics['sla_violation_rate']:.1%}")

πŸ› Troubleshooting

"Prometheus connection failed"

# Check Prometheus is running
kubectl get pod -n monitoring prometheus

# Port-forward if needed
kubectl port-forward -n monitoring svc/prometheus 9090:9090

"OPENAI_API_KEY not set"

export OPENAI_API_KEY="sk-..."
python main.py

"Policy validation failed"

Check SLA bounds are within cluster capabilities:

from policy.invariants import SLA_BOUNDS, print_sla_bounds
print_sla_bounds()

πŸ“ˆ Project Status

Phase Component Status Tests Coverage
1 Simulation (Kind, services, fault injector) βœ… Complete - -
2 Telemetry (collector, formatter, schemas) βœ… Complete - -
3 Policy Gate (validation, invariants) βœ… Complete 36/36 100%
4 Executor (remediation actions) βœ… Complete 18/18 100%
5 Evaluation (harness, report) βœ… Complete 26/26 100%
6 Configuration & Entrypoint πŸ”„ In Progress 1/2 50%

Overall: 75% Complete | Total Tests: 94/94 βœ… PASSING

🀝 Contributing

  1. Follow project structure in AGENTS.md
  2. Add unit tests for all new code
  3. Run pytest to verify all tests pass
  4. Update documentation with changes

πŸ“ License

MIT License - See LICENSE file for details

πŸ“ž Support

  • Issues: GitHub Issues
  • Documentation: See /docs and phase completion files
  • Examples: See examples/ directory

Last Updated: 2026-04-27 Project Status: 75% Complete (Phase 5 complete, Phase 6 in progress) Next Phase: Configuration integration and final testing

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors