Autonomous Reliability Engineer (ARE) for Critical Power Systems
An end-to-end agentic platform that manages the full lifecycle of a critical equipment failure β from the first anomalous "shiver" in sensor data to a validated, dispatched repair work order β without a human digging through a single database.
In critical infrastructure (data centers, hospitals, manufacturing plants), a power failure costs $millions per minute. Today's monitoring tools are passive observers β they show a chart, fire an alert, and wait for a human. The human then:
- Searches through 200-page technical manuals
- Queries the maintenance history database
- Cross-checks safety protocols
- Finally writes a work order β 45 minutes later
Kinetic-Core collapses that 45 minutes to 45 seconds.
flowchart TB
subgraph Edge["π Edge / Field"]
SENSORS["IoT Sensors\n(Generators Β· UPS Β· Transformers)"]
end
subgraph Ingestion["β‘ Ingestion Layer"]
HUB["Azure IoT Hub"]
EG["Event Grid"]
end
subgraph Intelligence["π§ Intelligence Layer"]
CA["Container Apps\nAgent Pipeline\nβββββββββββββ\nDiagnostic Agent\nLibrarian Agent\nPlanner Agent"]
SEARCH["Azure AI Search\nManuals Index\n(Vector + BM25)"]
end
subgraph Data["ποΈ Data Layer"]
COSMOS[("CosmosDB\nTelemetry Β· Incidents\nWork Orders Β· Memory")]
KV["Key Vault\nSecrets"]
end
subgraph Presentation["π₯οΈ Presentation"]
SWA["Static Web Apps\nReact Dashboard"]
end
subgraph Observability["π Observability"]
LAW["Log Analytics"]
APPI["App Insights"]
end
SENSORS -->|"MQTT / AMQP telemetry"| HUB
HUB -->|"device events"| EG
EG -->|"incident trigger"| CA
CA <-->|"read / write"| COSMOS
CA <-->|"semantic search"| SEARCH
KV -.->|"secrets at runtime"| CA
CA -->|"REST API"| SWA
CA -.->|"logs + traces"| LAW
CA -.->|"metrics"| APPI
| Stage | Azure Data Component | Azure AI Component |
|---|---|---|
| Ingestion | IoT Hub + Event Grid | Edge anomaly detection (streaming) |
| Processing | Data Factory + Cosmos DB | Semantic chunking of industrial PDFs |
| Knowledge | AI Search (Vector Store) | Hybrid RAG (keyword + vector embeddings) |
| Logic (Core) | Azure OpenAI GPT-4o | Multi-Agent Orchestration |
| Governance | AI Studio Evaluations | Hallucination detection + drift monitoring |
| CI/CD | GitHub Actions + Bicep | Prompt versioning lifecycle |
A gradual thermal escalation in a high-density server cooling rack β the kind a simple threshold would miss but an AI catches 4 hours early.
The "Hidden Fault": Coolant flow rate drops 8% over 6 hours due to a failing pump seal. Temperature rises non-linearly. Vibration signature shifts. A naive alert fires only when the thermal limit is breached. Kinetic-Core detects it at hour 2, identifies the fault code, finds the repair procedure in the manual, validates against safety protocol (voltage check), and dispatches a work order for the next maintenance window.
- Role: Data Scientist / Anomaly Analyst
- Input: Live telemetry stream (temperature, voltage, vibration, coolant flow)
- Output: Fault classification, severity score, root cause hypothesis
- Model: GPT-4o with structured outputs + time-series context
- Role: RAG Specialist
- Input: Fault code from Diagnostic Lead
- Output: Exact repair procedure, parts list, estimated duration
- Retrieval: Hybrid search (BM25 + Ada-002 embeddings) over PDF manuals
- Role: Senior Safety Engineer (Adversarial)
- Input: Proposed repair procedure + live voltage/current readings
- Output: GO / NO-GO decision with safety justification
- Rule: Cannot approve hot-swap if voltage > 480V or arc flash risk detected
- Role: Operations Coordinator
- Input: Approved repair plan
- Output: Formatted work order, technician assignment, parts requisition
- Integration: Cosmos DB memory, Azure Communication Services
kinetic-core/
βββ agents/ # Multi-agent system
β βββ diagnostic_lead/ # Anomaly detection + root cause
β βββ technical_librarian/ # RAG-powered repair lookup
β βββ safety_auditor/ # Safety protocol enforcement
β βββ orchestrator/ # Multi-agent coordination
βββ api/ # FastAPI backend
β βββ routers/ # Event, agent, workorder endpoints
β βββ models/ # Pydantic schemas
β βββ middleware/ # Auth, logging, rate limiting
βββ data/
β βββ schemas/ # JSON schemas for telemetry & logs
β βββ synthetic/
β β βββ telemetry/ # IoT data generator (with hidden fault)
β β βββ logs/ # SQL maintenance history seeder
β β βββ manuals/ # Technical manual content
βββ docs/
β βββ architecture/ # System design docs + Mermaid diagrams
β βββ adr/ # Architecture Decision Records
β βββ runbooks/ # Operational runbooks
βββ infra/
β βββ bicep/ # Azure IaC (IoT Hub, AI Search, OpenAI...)
β βββ scripts/ # Deployment automation
βββ ingestion/
β βββ iot_simulator/ # Publishes synthetic telemetry to IoT Hub
β βββ event_processor/ # Azure Function: IoT Hub β Cosmos DB
βββ knowledge/
β βββ chunker/ # Semantic PDF chunking
β βββ embedder/ # Azure OpenAI Ada-002 embedding pipeline
β βββ indexer/ # AI Search index management
βββ monitoring/
β βββ drift/ # Model drift detection
β βββ evaluation/ # AI Studio evaluation harness
βββ prompts/
β βββ versions/ # Versioned prompt snapshots (v1.0, v1.1...)
β βββ templates/ # Jinja2 prompt templates
βββ frontend/ # React dashboard
βββ notebooks/ # EDA, fault analysis, RAG evaluation
βββ tests/ # Unit, integration, e2e
βββ .github/workflows/ # CI/CD pipelines
- Azure CLI (
az login) - Python 3.11+
- Node.js 20+
- Docker (for local dev)
git clone https://github.com/vipul9811kumar/Kinetic-Core.git
cd Kinetic-Core
cp .env.template .env
# Fill in your Azure resource valuescd infra/bicep
az deployment group create \
--resource-group kinetic-core-rg \
--template-file main/main.bicep \
--parameters @main/parameters.prod.jsonpip install -e ".[dev]"
python data/synthetic/telemetry/generator.py # seed synthetic data
python knowledge/indexer/indexer.py # build AI Search index
uvicorn api.main:app --reloadpython agents/orchestrator/orchestrator.py --scenario thermal_runawaycd frontend
npm install && npm run dev| Resource | SKU | Purpose |
|---|---|---|
| Azure IoT Hub | S1 | Telemetry ingestion from sensors |
| Azure Event Grid | Standard | Event routing IoT β Functions |
| Azure Data Factory | Standard | Batch orchestration of historical data |
| Azure Cosmos DB | Serverless | Agent memory + operational logs |
| Azure OpenAI | GPT-4o + Ada-002 | Reasoning engine + embeddings |
| Azure AI Search | Standard | Hybrid RAG vector store |
| Azure Functions | Consumption | Agent hosting (serverless) |
| Azure Container Registry | Basic | Agent container images |
| Azure AI Studio | Standard | Evaluation + drift monitoring |
| Azure Monitor | Standard | Observability + alerting |
| Azure Key Vault | Standard | Secret management |
- Adversarial Reasoning: Safety Auditor is explicitly designed to reject the Diagnostic Lead's recommendation if safety thresholds are violated. This prevents the classic "optimization at the expense of safety" failure mode.
- Hybrid RAG: Pure vector search misses exact repair codes (e.g., "KX-T2209-B"); pure BM25 misses semantic context. Hybrid combines both.
- Prompt Versioning: Every prompt change is committed to
prompts/versions/and evaluated against a golden test set before deployment. - Cosmos DB Agent Memory: Each agent writes its reasoning trace to Cosmos DB. The orchestrator can replay any incident for audit or retraining.
- Edge Filtering: A lightweight statistical model at the IoT Hub level (Azure Stream Analytics) pre-filters noise before GPT-4o is invoked β controlling cost.
Hallucination Guard: Every repair step recommended by the Librarian is scored against the source document using GPT-4o with citation validation. Steps with < 0.85 faithfulness score are flagged.
Drift Monitor: Weekly batch job compares current anomaly detection accuracy against the baseline golden set. If F1 drops > 5%, a Slack alert fires and a retraining job is triggered.
Prompt Registry: Every production prompt has a version ID, evaluation score, and deployment timestamp stored in Cosmos DB.
MIT
Built to demonstrate enterprise-grade autonomous AI for critical infrastructure. Every component maps to a real-world production pattern used by Fortune 500 operations teams.