A production-grade Agentic System designed to detect, diagnose, and resolve infrastructure incidents autonomously—with strict Human-in-the-Loop governance.
OpsSwarm is not just a chatbot; it is a state-aware Site Reliability Engineering (SRE) Agent. It connects to live infrastructure logs, reasons about root causes using Groq's LPU, and executes remediation tools via the Model Context Protocol (MCP).
Crucially, it implements a Safety-First Architecture:
- Cyclic Reasoning: Uses LangGraph to plan and verify fixes before executing.
- Tool Isolation: Tools run on a separate MCP Server, decoupling the AI from the OS.
- Deterministic Guardrails: NeMo Guardrails intercept and block destructive commands (like
rm -rforDROP DB) at the embedding level.
- 🧠 Cognitive Architecture: Implements a
Diagnose -> Plan -> Approval -> Executeloop. - ⚡ Sub-Second Inference: Powered by Llama-3-70b on Groq, enabling real-time log parsing.
- 🔌 Model Context Protocol (MCP): Standardized interface for connecting LLMs to local/remote tools (Docker, K8s, CLI).
- 👨💻 Human-in-the-Loop (HITL): Critical actions require explicit operator approval via the UI.
- 🛡️ Enterprise Security: Custom Colang flows prevent prompt injection and unauthorized actions.
The system detects a critical failure, diagnoses the root cause, and proposes a fix. It then pauses for human verification.
1. Incident Detection & Diagnosis
The agent parses raw server logs, identifies Database Shard 04 Connection Refused, and formulates a plan.

2. Human-in-the-Loop Approval
The workflow halts at a "Conditional Edge." The agent cannot proceed without a state update from the operator.

3. Execution & Resolution
Once approved, the agent executes the restart_resource tool via MCP and verifies system health.

Demonstrating resilience against malicious prompts or accidental destructive commands.
1. Simulated Attack
An operator (or prompt injector) attempts to force the agent to delete the production database.

2. Guardrail Interception
The NeMo Guardrails layer intercepts the intent before it reaches the planner. The command is blocked deterministically.

| Component | Technology | Role |
|---|---|---|
| Orchestrator | LangGraph | Manages the cyclic state machine and agent memory. |
| Inference Engine | Groq API | Provides Llama-3-70b inference at ~300 tokens/sec. |
| Tooling Layer | MCP (Model Context Protocol) | Standardizes tool execution (Server/Client architecture). |
| Safety Layer | NeMo Guardrails | Enforces security policies using Colang definitions. |
| Data Validation | Pydantic | Ensures strict schema compliance for all agent outputs. |
| Frontend | Streamlit | Provides the interactive SRE Dashboard. |
opsswarm/
├── config/
│ ├── settings.py # Centralized App Configuration
│ └── rails/ # NeMo Guardrails Configs
│ ├── config.yml # Model & Flow Definitions
│ └── security.co # Colang Security Rules
├── src/
│ ├── graph.py # LangGraph State Machine (The Brain)
│ ├── mcp_server.py # Tool Provider (The Hands)
│ ├── mcp_client.py # Tool Connector
│ └── state.py # Pydantic Data Models
├── results/ # Demo Screenshots
├── app.py # Streamlit Dashboard Entry Point
├── requirements.txt # Dependencies
└── .env # Secrets (API Keys)
-
Install Dependencies Open a terminal in the project root (
opsswarm/) and run:pip install -r requirements.txt
-
Configure Environment Create a
.envfile in the root directory and add your API keys:PROJECT_NAME="OpsSwarm Enterprise" GROQ_API_KEY=gsk_your_key_here # Optional: LangSmith Keys # LANGCHAIN_TRACING_V2=true # LANGCHAIN_API_KEY=lsv2_your_key_here
-
Run the Application
streamlit run app.py
- Kubernetes Integration: Move MCP server to a sidecar pod for K8s cluster management.
- Slack Integration: Allow approval/rejection directly via Slack channels.
- RAG Knowledge Base: Indexing past incident reports (Post-Mortems) to improve diagnosis accuracy
