Skip to content

A Site Reliability Engineer AI agent that can monitor application and infrastructure logs, diagnose issues, and report on diagnostics.

License

Notifications You must be signed in to change notification settings

fuzzylabs/sre-agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ Site Reliability Engineer (SRE) Agent πŸ•΅οΈ

Welcome to the SRE Agent project! This open-source AI agent is here to assist your debugging, keep your systems healthy, and make your DevOps life a whole lot easier. Plug in your Kubernetes cluster, GitHub repo, and Slack, and let the agent do the heavy liftingβ€”diagnosing, reporting, and keeping your team in the loop.

🌟 What is SRE Agent?

SRE Agent is your AI-powered teammate for monitoring application and infrastructure logs, diagnosing issues, and reporting diagnostics after errors. It connects directly into your stack, so you can focus on building, not firefighting.

SRE Agent in action

πŸ€” Why Did We Build This?

We wanted to learn best practices, costs, security, and performance tips for AI agents in production. Our journey is open-sourceβ€”check out our Production Journey Page and Agent Architecture Page for the full story.

We've been writing blogs and sharing our learnings along the way. Check out our blog for insights and updates.

Contributions welcome! Join us and help shape the future of AI-powered SRE.

✨ Features

  • πŸ•΅οΈβ€β™‚οΈ Root Cause Debugging – Finds the real reason behind app and system errors
  • πŸ“œ Kubernetes Logs – Queries your cluster for logs and info
  • πŸ” GitHub Search – Digs through your codebase for bugs
  • πŸ’¬ Slack Integration – Notifies and updates your team
  • 🚦 Diagnose from Anywhere – Trigger diagnostics with a simple endpoint

Powered by the Model Context Protocol (MCP) for seamless LLM-to-tool connectivity.

πŸ€– Supported LLM Providers

The SRE Agent supports multiple the following LLM providers:

Anthropic

  • Models: e.g. "claude-4-0-sonnet-latest"
  • Setup: Requires ANTHROPIC_API_KEY

Google Gemini

  • Models: e.g, "gemini-2.5-flash"
  • Setup: Requires GEMINI_API_KEY

πŸ› οΈ Prerequisites

  • Docker
  • A .env file in your project root (see below)
  • An app deployed on AWS EKS (Elastic Kubernetes Service) or GCP GKE (Google Kubernetes Engine)

⚑ Getting Started

Ready to see the agent in action? Our setup script will guide you through credential configuration, then you'll manually start the containers.

πŸš€ Credential Setup

Use our interactive setup script to configure your credentials:

python setup_credentials.py

The script will:

  • βœ… Auto-detect your platform (AWS/GCP) or let you choose
  • βœ… Guide you through credential setup with helpful prompts
  • βœ… Show current values and let you update them
  • βœ… Create your .env file automatically

πŸ”§ Setup Options

Quick start with platform selection:

python setup_credentials.py --platform aws
# or
python setup_credentials.py --platform gcp

1️⃣ Connect to Your Kubernetes Cluster

For AWS EKS:

  1. Go to your AWS access portal and grab your access keys: key

  2. Choose Option 2 and copy credentials into ~/.aws/credentials: option_2

    [default]
    aws_access_key_id=ABCDEFG12345
    aws_secret_access_key=abcdefg123456789
    aws_session_token=abcdefg123456789....=

For GCP GKE:

Set up your GCP credentials using the gcloud CLI:

gcloud auth login
gcloud config set project YOUR_PROJECT_ID

2️⃣ Start the Containers

After setting up your credentials, start the containers manually:

For AWS:

docker compose -f compose.aws.yaml up --build

For GCP:

docker compose -f compose.gcp.yaml up --build
🚒 Deploy with ECR images

See ECR Setup for details.

docker compose -f compose.ecr.yaml up

Note: AWS credentials must be in your ~/.aws/credentials file.

You'll see logs like this when everything's running:

orchestrator-1   |    FastAPI   Starting production server πŸš€
orchestrator-1   |
orchestrator-1   |              Searching for package file structure from directories with
orchestrator-1   |              __init__.py files
kubernetes-1     | βœ… Kubeconfig updated successfully.
kubernetes-1     | πŸš€ Starting Node.js application...
orchestrator-1   |              Importing from /
orchestrator-1   |
orchestrator-1   |     module   πŸ“ app
orchestrator-1   |              β”œβ”€β”€ 🐍 __init__.py
orchestrator-1   |              └── 🐍 client.py
orchestrator-1   |
orchestrator-1   |       code   Importing the FastAPI app object from the module with the following
orchestrator-1   |              code:
orchestrator-1   |
orchestrator-1   |              from app.client import app
orchestrator-1   |
orchestrator-1   |        app   Using import string: app.client:app
orchestrator-1   |
orchestrator-1   |     server   Server started at http://0.0.0.0:80
orchestrator-1   |     server   Documentation at http://0.0.0.0:80/docs
orchestrator-1   |
orchestrator-1   |              Logs:
orchestrator-1   |
orchestrator-1   |       INFO   Started server process [1]
orchestrator-1   |       INFO   Waiting for application startup.
orchestrator-1   |       INFO   Application startup complete.
orchestrator-1   |       INFO   Uvicorn running on http://0.0.0.0:80 (Press CTRL+C to quit)
kubernetes-1     | 2025-04-24 12:53:00 [info]: Initialising Kubernetes manager {
kubernetes-1     |   "service": "kubernetes-server"
kubernetes-1     | }
kubernetes-1     | 2025-04-24 12:53:00 [info]: Kubernetes manager initialised successfully {
kubernetes-1     |   "service": "kubernetes-server"
kubernetes-1     | }
kubernetes-1     | 2025-04-24 12:53:00 [info]: Starting SSE server {
kubernetes-1     |   "service": "kubernetes-server"
kubernetes-1     | }
kubernetes-1     | 2025-04-24 12:53:00 [info]: mcp-kubernetes-server is listening on port 3001
kubernetes-1     | Use the following url to connect to the server:
kubernetes-1     | http://localhost:3001/sse {
kubernetes-1     |   "service": "kubernetes-server"
kubernetes-1     | }

This means all the services β€” Slack, GitHub, the orchestrator, the prompt and the MCP servers have started successfully and are ready to handle requests.

πŸ§‘β€πŸ’» Using the Agent

Trigger a diagnosis with a simple curl command:

curl -X POST http://localhost:8003/diagnose \
  -H "accept: application/json" \
  -H "Authorization: Bearer <token>" \
  -d "text=<service>"
  • Replace <token> with your dev bearer token (from .env)
  • Replace <service> with the name of your target Kubernetes service

The agent will do its thing and report back in your configured Slack channel πŸŽ‰

🩺 Checking Service Health

A /health endpoint is available on the orchestrator service:

curl -X GET http://localhost:8003/health
  • 200 OK = All systems go!
  • 503 Service Unavailable = Something's up; check the response for details.

πŸš€ Deployments

Want to run this in the cloud? Check out our deployment examples:

πŸ“š Documentation

Find all the docs you need in the docs folder:

πŸ™ Acknowledgements & Attribution

Big thanks to:

πŸ“– Blogs

Check out our blog posts for insights and updates:

About

A Site Reliability Engineer AI agent that can monitor application and infrastructure logs, diagnose issues, and report on diagnostics.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 8