Skip to content

vipul9811kumar/Kinetic-Core

Repository files navigation

Kinetic-Core

Autonomous Reliability Engineer (ARE) for Critical Power Systems

An end-to-end agentic platform that manages the full lifecycle of a critical equipment failure β€” from the first anomalous "shiver" in sensor data to a validated, dispatched repair work order β€” without a human digging through a single database.


The Problem

In critical infrastructure (data centers, hospitals, manufacturing plants), a power failure costs $millions per minute. Today's monitoring tools are passive observers β€” they show a chart, fire an alert, and wait for a human. The human then:

  1. Searches through 200-page technical manuals
  2. Queries the maintenance history database
  3. Cross-checks safety protocols
  4. Finally writes a work order β€” 45 minutes later

Kinetic-Core collapses that 45 minutes to 45 seconds.


Architecture Overview

flowchart TB
    subgraph Edge["🏭 Edge / Field"]
        SENSORS["IoT Sensors\n(Generators Β· UPS Β· Transformers)"]
    end

    subgraph Ingestion["⚑ Ingestion Layer"]
        HUB["Azure IoT Hub"]
        EG["Event Grid"]
    end

    subgraph Intelligence["🧠 Intelligence Layer"]
        CA["Container Apps\nAgent Pipeline\n─────────────\nDiagnostic Agent\nLibrarian Agent\nPlanner Agent"]
        SEARCH["Azure AI Search\nManuals Index\n(Vector + BM25)"]
    end

    subgraph Data["πŸ—„οΈ Data Layer"]
        COSMOS[("CosmosDB\nTelemetry Β· Incidents\nWork Orders Β· Memory")]
        KV["Key Vault\nSecrets"]
    end

    subgraph Presentation["πŸ–₯️ Presentation"]
        SWA["Static Web Apps\nReact Dashboard"]
    end

    subgraph Observability["πŸ“Š Observability"]
        LAW["Log Analytics"]
        APPI["App Insights"]
    end

    SENSORS  -->|"MQTT / AMQP telemetry"| HUB
    HUB      -->|"device events"| EG
    EG       -->|"incident trigger"| CA
    CA      <-->|"read / write"| COSMOS
    CA      <-->|"semantic search"| SEARCH
    KV       -.->|"secrets at runtime"| CA
    CA       -->|"REST API"| SWA
    CA       -.->|"logs + traces"| LAW
    CA       -.->|"metrics"| APPI
Loading

The "Architecture of Impact" β€” Full AI Spectrum

Stage Azure Data Component Azure AI Component
Ingestion IoT Hub + Event Grid Edge anomaly detection (streaming)
Processing Data Factory + Cosmos DB Semantic chunking of industrial PDFs
Knowledge AI Search (Vector Store) Hybrid RAG (keyword + vector embeddings)
Logic (Core) Azure OpenAI GPT-4o Multi-Agent Orchestration
Governance AI Studio Evaluations Hallucination detection + drift monitoring
CI/CD GitHub Actions + Bicep Prompt versioning lifecycle

Target Scenario: Thermal Runaway in Data Center Cooling Rack

A gradual thermal escalation in a high-density server cooling rack β€” the kind a simple threshold would miss but an AI catches 4 hours early.

The "Hidden Fault": Coolant flow rate drops 8% over 6 hours due to a failing pump seal. Temperature rises non-linearly. Vibration signature shifts. A naive alert fires only when the thermal limit is breached. Kinetic-Core detects it at hour 2, identifies the fault code, finds the repair procedure in the manual, validates against safety protocol (voltage check), and dispatches a work order for the next maintenance window.


Multi-Agent Personas

Agent 1: The Diagnostic Lead

  • Role: Data Scientist / Anomaly Analyst
  • Input: Live telemetry stream (temperature, voltage, vibration, coolant flow)
  • Output: Fault classification, severity score, root cause hypothesis
  • Model: GPT-4o with structured outputs + time-series context

Agent 2: The Technical Librarian

  • Role: RAG Specialist
  • Input: Fault code from Diagnostic Lead
  • Output: Exact repair procedure, parts list, estimated duration
  • Retrieval: Hybrid search (BM25 + Ada-002 embeddings) over PDF manuals

Agent 3: The Safety Auditor

  • Role: Senior Safety Engineer (Adversarial)
  • Input: Proposed repair procedure + live voltage/current readings
  • Output: GO / NO-GO decision with safety justification
  • Rule: Cannot approve hot-swap if voltage > 480V or arc flash risk detected

Agent 4: The Orchestrator

  • Role: Operations Coordinator
  • Input: Approved repair plan
  • Output: Formatted work order, technician assignment, parts requisition
  • Integration: Cosmos DB memory, Azure Communication Services

Repository Structure

kinetic-core/
β”œβ”€β”€ agents/                     # Multi-agent system
β”‚   β”œβ”€β”€ diagnostic_lead/        # Anomaly detection + root cause
β”‚   β”œβ”€β”€ technical_librarian/    # RAG-powered repair lookup
β”‚   β”œβ”€β”€ safety_auditor/         # Safety protocol enforcement
β”‚   └── orchestrator/           # Multi-agent coordination
β”œβ”€β”€ api/                        # FastAPI backend
β”‚   β”œβ”€β”€ routers/                # Event, agent, workorder endpoints
β”‚   β”œβ”€β”€ models/                 # Pydantic schemas
β”‚   └── middleware/             # Auth, logging, rate limiting
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ schemas/                # JSON schemas for telemetry & logs
β”‚   β”œβ”€β”€ synthetic/
β”‚   β”‚   β”œβ”€β”€ telemetry/          # IoT data generator (with hidden fault)
β”‚   β”‚   β”œβ”€β”€ logs/               # SQL maintenance history seeder
β”‚   β”‚   └── manuals/            # Technical manual content
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ architecture/           # System design docs + Mermaid diagrams
β”‚   β”œβ”€β”€ adr/                    # Architecture Decision Records
β”‚   └── runbooks/               # Operational runbooks
β”œβ”€β”€ infra/
β”‚   β”œβ”€β”€ bicep/                  # Azure IaC (IoT Hub, AI Search, OpenAI...)
β”‚   └── scripts/                # Deployment automation
β”œβ”€β”€ ingestion/
β”‚   β”œβ”€β”€ iot_simulator/          # Publishes synthetic telemetry to IoT Hub
β”‚   └── event_processor/        # Azure Function: IoT Hub β†’ Cosmos DB
β”œβ”€β”€ knowledge/
β”‚   β”œβ”€β”€ chunker/                # Semantic PDF chunking
β”‚   β”œβ”€β”€ embedder/               # Azure OpenAI Ada-002 embedding pipeline
β”‚   └── indexer/                # AI Search index management
β”œβ”€β”€ monitoring/
β”‚   β”œβ”€β”€ drift/                  # Model drift detection
β”‚   └── evaluation/             # AI Studio evaluation harness
β”œβ”€β”€ prompts/
β”‚   β”œβ”€β”€ versions/               # Versioned prompt snapshots (v1.0, v1.1...)
β”‚   └── templates/              # Jinja2 prompt templates
β”œβ”€β”€ frontend/                   # React dashboard
β”œβ”€β”€ notebooks/                  # EDA, fault analysis, RAG evaluation
β”œβ”€β”€ tests/                      # Unit, integration, e2e
└── .github/workflows/          # CI/CD pipelines

Quick Start

Prerequisites

  • Azure CLI (az login)
  • Python 3.11+
  • Node.js 20+
  • Docker (for local dev)

1. Clone & Configure

git clone https://github.com/vipul9811kumar/Kinetic-Core.git
cd Kinetic-Core
cp .env.template .env
# Fill in your Azure resource values

2. Deploy Azure Infrastructure

cd infra/bicep
az deployment group create \
  --resource-group kinetic-core-rg \
  --template-file main/main.bicep \
  --parameters @main/parameters.prod.json

3. Install & Run Backend

pip install -e ".[dev]"
python data/synthetic/telemetry/generator.py  # seed synthetic data
python knowledge/indexer/indexer.py           # build AI Search index
uvicorn api.main:app --reload

4. Run the Agents

python agents/orchestrator/orchestrator.py --scenario thermal_runaway

5. Start Frontend

cd frontend
npm install && npm run dev

Azure Resources Deployed

Resource SKU Purpose
Azure IoT Hub S1 Telemetry ingestion from sensors
Azure Event Grid Standard Event routing IoT β†’ Functions
Azure Data Factory Standard Batch orchestration of historical data
Azure Cosmos DB Serverless Agent memory + operational logs
Azure OpenAI GPT-4o + Ada-002 Reasoning engine + embeddings
Azure AI Search Standard Hybrid RAG vector store
Azure Functions Consumption Agent hosting (serverless)
Azure Container Registry Basic Agent container images
Azure AI Studio Standard Evaluation + drift monitoring
Azure Monitor Standard Observability + alerting
Azure Key Vault Standard Secret management

Key Design Decisions

  • Adversarial Reasoning: Safety Auditor is explicitly designed to reject the Diagnostic Lead's recommendation if safety thresholds are violated. This prevents the classic "optimization at the expense of safety" failure mode.
  • Hybrid RAG: Pure vector search misses exact repair codes (e.g., "KX-T2209-B"); pure BM25 misses semantic context. Hybrid combines both.
  • Prompt Versioning: Every prompt change is committed to prompts/versions/ and evaluated against a golden test set before deployment.
  • Cosmos DB Agent Memory: Each agent writes its reasoning trace to Cosmos DB. The orchestrator can replay any incident for audit or retraining.
  • Edge Filtering: A lightweight statistical model at the IoT Hub level (Azure Stream Analytics) pre-filters noise before GPT-4o is invoked β€” controlling cost.

Governance & Evaluation

Hallucination Guard: Every repair step recommended by the Librarian is scored against the source document using GPT-4o with citation validation. Steps with < 0.85 faithfulness score are flagged.

Drift Monitor: Weekly batch job compares current anomaly detection accuracy against the baseline golden set. If F1 drops > 5%, a Slack alert fires and a retraining job is triggered.

Prompt Registry: Every production prompt has a version ID, evaluation score, and deployment timestamp stored in Cosmos DB.


License

MIT


Built to demonstrate enterprise-grade autonomous AI for critical infrastructure. Every component maps to a real-world production pattern used by Fortune 500 operations teams.

About

A next-generation "Autonomous Reliability" platform designed for critical infrastructure. By bridging the gap between real-time industrial telemetry (IoT) and unstructured technical intelligence (RAG), Kinetic-Core utilizes a multi-agent orchestration framework to diagnose, prescribe, and initiate remediation for equipment failures.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors