In [None]:
%env OPENAI_API_KEY="keyhere"

In [None]:
#!/bin/bash

helm upgrade --install argo-rollouts argo-rollouts \
  --repo https://argoproj.github.io/argo-helm \
  --version 2.37.6 \
  --namespace argo-rollouts \
  --create-namespace \
  --wait

kubectl apply -f- <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: rollouts-demo
  labels:
    app: rollouts-demo
spec:
  replicas: 4
  selector:
    matchLabels:
      app: rollouts-demo
  template:
    metadata:
      labels:
        app: rollouts-demo
    spec:
      containers:
      - name: rollouts-demo
        image: argoproj/rollouts-demo:blue
        ports:
        - name: http
          containerPort: 8080
          protocol: TCP
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1
---
apiVersion: v1
kind: Service
metadata:
  name: rollouts-demo
  labels:
    app: rollouts-demo
spec:
  ports:
  - port: 80
    targetPort: http
    protocol: TCP
    name: http
  selector:
    app: rollouts-demo
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: gateway
spec:
  gatewayClassName: istio
  listeners:
  - name: default
    port: 80
    protocol: HTTP
    allowedRoutes:
      namespaces:
        from: All
---
kind: HTTPRoute
apiVersion: gateway.networking.k8s.io/v1beta1
metadata:
  name: rollouts-demo
spec:
  parentRefs:
    - name: gateway
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /  
    backendRefs:
    - name: rollouts-demo
      kind: Service
      port: 80
EOF

In [2]:
import logging
from datetime import datetime
from typing import Dict
from autogen_agentchat.agents import AssistantAgent
from autogen_agentchat.conditions import MaxMessageTermination, TextMentionTermination
from autogen_agentchat.teams import SelectorGroupChat
from autogen_agentchat.ui import Console
from autogen_ext.models.openai import OpenAIChatCompletionClient

# Import all required tools
from kagent.tools.argo import (
    PauseRollout,
    PromoteRollout,
    SetRolloutImage,
    GenerateResource,
)
from kagent.tools.k8s import (
    ApplyManifest,
    GetResources,
    GetPodLogs,
    PatchResource,
    DeleteResource,
    DescribeResource,
    GetResourceYAML,
)
from kagent.tools.prometheus import (
    QueryTool,
    QueryRangeTool,
    Config as PrometheusConfig,
    SeriesQueryTool,
    LabelNamesTool,
)

# Prometheus configuration with analysis thresholds
PROMETHEUS_CONFIG = PrometheusConfig(
    name="prom_config",
    base_url="http://localhost:9090/api/v1",
)

PROMETHEUS_SYSTEM_MESSAGE = f"""
You are a Prometheus monitoring specialist focused on metric analysis, troubleshooting, and performance optimization. 
Use available tools to query, analyze, and provide actionable insights.

## Core Capabilities
- Instant and range queries for metrics analysis
- Series and label discovery for metric exploration
- Target and alert monitoring
- Resource utilization tracking
- Performance analysis and recommendations

## Query Guidelines
1. Validate metric existence and labels first
2. Use appropriate time windows and aggregations
3. Consider query efficiency and performance
4. Follow PromQL best practices

## Response Format

### Basic Queries
```
Query:
<PromQL code block>

Results:
- Current value with units
- Context/threshold comparison
- Key insights
- Recommendations if needed
```

### Complex Analysis
```
1. Query Details
<PromQL code block>
- Purpose and components
- Key parameters used

2. Results
- Current values and trends
- Comparisons to thresholds
- Notable patterns

3. Analysis & Recommendations
- Performance interpretation
- Action items if needed
- Additional metrics to watch
```

## Example Patterns

### Service Performance
```promql
# Latency (p95)
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{{service="$service"}}[5m])) by (le))

# Error Rate
sum(rate(http_requests_total{{status=~"5..",service="$service"}}[5m])) 
/ 
sum(rate(http_requests_total{{service="$service"}}[5m])) * 100
```

### Resource Usage
```promql
# Memory Usage
sum by (pod) (container_memory_usage_bytes{{container!=""}}) / 1024^3

# CPU Utilization
sum by (pod) (rate(container_cpu_usage_seconds_total{{container!=""}}[5m])) * 100
```

## Example Response

**Query**: "Check auth service latency"

```promql
histogram_quantile(0.95, 
sum by (le) (rate(http_request_duration_seconds_bucket{{service="auth"}}[5m]))
)
```

**Results**:
- P95 Latency: 245ms (SLO: 300ms)
- Hourly avg: 198ms
- Status: Healthy

**Analysis**:
- Within SLO but trending up
- No correlated error increase
- Monitor for sustained elevation

**Recommendations**:
- Continue standard monitoring
- Investigate if exceeds 250ms for >30min
- Check recent changes if trend continues

## Best Practices
- Validate assumptions
- Provide clear explanations
- Consider business impact
- Suggest proactive improvements
- Document significant findings
"""

ARGO_DEBUG_SYSTEM_MESSAGE = f"""
You are an Argo debugging and deployment specialist focused on managing, troubleshooting, and resolving issues with Argo Rollouts deployments.
Assume that the Argo Rollouts controller is installed and configured correctly.

Core Capabilities:
1. Rollout Management:
   - Check rollout status and phase
   - Monitor progression
   - Identify stalled states
   - Track promotion status

2. Rollout Diagnostics:
   - Analyze failure conditions in the Argo Rollout resources statuses or Argo Rollouts controller logs
   - Debug promotion/abortion issues
   - Validate step execution
   - Verify traffic routing resources (Istio, Gateway API, etc)

3. Configuration Verification includes:
   - Validate rollout spec and status
   - Check analysis runs if an analysis is attached to the rollout.
   - Verify metric templates. The prometheus_agent will help you with this.
   - Confirm traffic routing rules. The k8s_agent will help you with this based on the traffic controller configuration (Istio, Gateway API, etc).

Standard Procedures:
1. Status Assessment:
   - Check rollout phase
   - Review analysis results
   - Validate configuration

2. Issue Resolution:
   - Identify root cause
   - Suggest remediation steps
   - Verify fixes
   - Document findings

Best Practices:
1. Always check rollout status first with the status rollout tool.
2. If an analysis is running, check the status of the analysis with the kubectl describe tool.
2. Identify the traffic controller configuration (Istio, Gateway API, etc) and validate the traffic routing rules.
4. If an analysis is running, Validate analysis metrics with the prometheus agent.
5. Document troubleshooting steps

Example commands:
# kubectl get rollouts -A -oyaml # list all rollouts in the cluster
# kubectl get rollout <name> -n <namespace> -oyaml # get a specific rollout resource
# kubectl describe analysisrun <name> -n <namespace> # describe a specific analysisrun
"""

# Create model client
model_client = OpenAIChatCompletionClient(
    model="gpt-4o",
)

# Create Argo debug agent with all necessary tools
argo_debug_agent = AssistantAgent(
    "argo_agent",
    description="Argo Rollouts specialist for deployment and debugging",
    tools=[
        PauseRollout(),
        PromoteRollout(),
        SetRolloutImage(),
        GenerateResource(),
        GetResources(),
        GetResourceYAML(),
        DescribeResource(),
        GetPodLogs(),
    ],
    model_client=model_client,
    system_message=ARGO_DEBUG_SYSTEM_MESSAGE,
)

# [Existing prometheus_agent configuration remains the same]
prometheus_agent = AssistantAgent(
    "prometheus_agent",
    description="An agent for Prometheus",
    tools=[
        QueryTool(config=PROMETHEUS_CONFIG),
        QueryRangeTool(config=PROMETHEUS_CONFIG),
        SeriesQueryTool(config=PROMETHEUS_CONFIG),
        LabelNamesTool(config=PROMETHEUS_CONFIG),
    ],
    model_client=model_client,
    system_message=PROMETHEUS_SYSTEM_MESSAGE,
)

# [Existing k8s_agent configuration remains the same]
k8s_agent = AssistantAgent(
    "k8s_agent",
    description="Kubernetes operations specialist",
    tools=[ApplyManifest(), PatchResource(), DeleteResource(), DescribeResource()],
    model_client=model_client,
    system_message="""
    You are a Kubernetes specialist agent responsible for cluster operations and resource updates.

    Key Responsibilities:
    1. Resource Management:
        - Apply and modify Kubernetes manifests
        - Verify successful application of changes

    Always:
    - Verify changes after application
    - Report any issues or anomalies immediately
    """,
)

planning_agent = AssistantAgent(
    "PlanningAgent",
    description="An agent for planning tasks, this agent should be the first to engage when given a new task.",
    model_client=model_client,
    system_message="""
    You are a planning agent.
    Your job is to break down complex tasks into smaller, manageable subtasks that can be executed by the team members. DO NOT MAKE UP ADDITIONAL AND UNNECESSARY SUBTASKS.
    Your team members are:
        k8s_agent: Performs Kubernetes tasks such as applying resources and getting/listing kubernetes resources
        argo_agent: Performs Argo Rollouts deployment and debugging tasks
        prometheus_agent: Handles metrics and monitoring through Prometheus

    You only plan and delegate tasks - you do not execute them yourself.

    When assigning tasks, use this format:
    1. <agent> : <task>

    After all tasks are complete, summarize the findings and end with "TERMINATE".
    """,
)

# Create team with updated agent list
team = SelectorGroupChat(
    [planning_agent, argo_debug_agent, k8s_agent, prometheus_agent],
    model_client=model_client,
    termination_condition=TextMentionTermination("TERMINATE") | MaxMessageTermination(max_messages=25),
    allow_repeated_speaker=True,
)


# Run team task function remains the same
async def run_team_task(task: str):
    """Run a task through the multi-agent team with proper error handling. If you don't have any explicit tasks left to complete, return TERMINATE."""
    try:
        return await Console(team.run_stream(task=task))
    except Exception as e:
        logging.error(f"Error executing task: {e}")
        return None


# Usage example:
# Examples:
# task = "Create an Argo Rollout to deploy a new version of the demo application with the color purple using the Kubernetes Gateway API in my cluster."
# task = "Check if there are any argo rollout in the cluster in the process of promotion?"
# task = "Use the Kubernetes Gateway API and Argo Rollouts to create rollout resources for the canary and stable services for the demo application in my cluster."
# task = "Check if the Argo Rollouts controller is running and in a healthy state in the cluster?"
# task = "Create an Argo Rollout to deploy a new version of reviews-v1 using this image docker.io/istio/examples-bookinfo-reviews-v1:1.20.1?"
task = "Why is my reviews-v2 Argo Rollout not available?"

await run_team_task(task)

---------- user ----------
Why is my reviews-v2 Argo Rollout not available?
---------- argo_agent ----------
[FunctionCall(id='call_1fvJL3MrHC5cyTkyrl4l2BzF', arguments='{"resource_type": "rollout", "name": "reviews-v2", "ns": null}', name='get_resource_yaml'), FunctionCall(id='call_lst4XYqREPezjyJ9vSTKIOGa', arguments='{"resource_type": "rollout", "name": "reviews-v2", "ns": null}', name='describe_resource')]
---------- argo_agent ----------
[FunctionExecutionResult(content='apiVersion: argoproj.io/v1alpha1\nkind: Rollout\nmetadata:\n  creationTimestamp: "2025-02-22T22:34:48Z"\n  generation: 1\n  name: reviews-v2\n  namespace: default\n  resourceVersion: "30856"\n  uid: a0fb6895-ce28-4822-b758-2a561f59e81d\nspec:\n  replicas: 1\n  selector:\n    matchLabels:\n      app: reviews-v2\n  strategy:\n    canary:\n      steps:\n      - setCanaryScale:\n          matchTrafficWeight: true\n          replicas: 2\n      - pause:\n          duration: 30s\n      - setWeight: 100\n  template:\n    

  model_result = await self._model_client.create(


---------- argo_agent ----------
The "reviews-v2" Argo Rollout is currently in a **Degraded** phase due to an **InvalidSpec** error. The rollout YAML and description indicate the following issue:

- **Error Message**: "The Rollout 'reviews-v2' is invalid: spec.strategy.trafficRouting: Required value: SetCanaryScale requires TrafficRouting to be set."

### Root Cause
The rollout spec is missing a required `trafficRouting` configuration in the strategy. When using `setCanaryScale`, `trafficRouting` settings are mandatory to properly manage how traffic is directed during the rollout process.

### Remediation Steps
1. **Add Traffic Routing Configuration**: Review and update your rollout YAML to include an appropriate `trafficRouting` configuration under the `strategy` section. This setup might involve defining Istio, Gateway API, or a similar configuration based on what you're using.

2. **Validate Configuration**: Once updated, validate the rollout configuration again to ensure the spec i

TaskResult(messages=[TextMessage(source='user', models_usage=None, content='Why is my reviews-v2 Argo Rollout not available?', type='TextMessage'), ToolCallRequestEvent(source='argo_agent', models_usage=RequestUsage(prompt_tokens=1153, completion_tokens=248), content=[FunctionCall(id='call_1fvJL3MrHC5cyTkyrl4l2BzF', arguments='{"resource_type": "rollout", "name": "reviews-v2", "ns": null}', name='get_resource_yaml'), FunctionCall(id='call_lst4XYqREPezjyJ9vSTKIOGa', arguments='{"resource_type": "rollout", "name": "reviews-v2", "ns": null}', name='describe_resource')], type='ToolCallRequestEvent'), ToolCallExecutionEvent(source='argo_agent', models_usage=None, content=[FunctionExecutionResult(content='apiVersion: argoproj.io/v1alpha1\nkind: Rollout\nmetadata:\n  creationTimestamp: "2025-02-22T22:34:48Z"\n  generation: 1\n  name: reviews-v2\n  namespace: default\n  resourceVersion: "30856"\n  uid: a0fb6895-ce28-4822-b758-2a561f59e81d\nspec:\n  replicas: 1\n  selector:\n    matchLabels:\n 

In [None]:
print(team.dump_component().model_dump_json(indent=2))