# Day 30: AI-Powered Resilience Replay Generator

Reconstruct incident timelines, simulate expected system behavior, and generate replay narratives using ML and LLM.

In [1]:
# 📦 Import libraries
import pandas as pd
import numpy as np
import requests
from sklearn.linear_model import LinearRegression
from IPython.display import display, Markdown

# 📁 Load data
incident = pd.read_csv('incident_metadata.csv', parse_dates=['timestamp'])
baseline = pd.read_csv('historical_baseline.csv', parse_dates=['timestamp'])
actual = pd.read_csv('actual_metrics.csv', parse_dates=['timestamp'])

# 🧠 Simulate expected behavior using ML (simple regression per metric/service)
simulated = []
for idx, row in incident.iterrows():
    service = row['service']
    ts = row['timestamp']
    expected_sla = row['expected_sla']
    # Find baseline metric for this service
    base_rows = baseline[(baseline['service'] == service)]
    actual_rows = actual[(actual['service'] == service) & (actual['timestamp'] == ts)]
    for metric in base_rows['metric'].unique():
        base_metric = base_rows[base_rows['metric'] == metric]
        if len(base_metric) < 2:
            continue
        X = np.arange(len(base_metric)).reshape(-1, 1)
        y = base_metric['value'].values
        model = LinearRegression().fit(X, y)
        pred = model.predict(np.array([[len(base_metric)]]))[0]
        actual_val = actual_rows[actual_rows['metric'] == metric]['value'].values
        actual_val = actual_val[0] if len(actual_val) > 0 else None
        deviation = None if actual_val is None else actual_val - pred
        simulated.append({
            'timestamp': ts,
            'service': service,
            'metric': metric,
            'expected': pred,
            'actual': actual_val,
            'deviation': deviation
        })

# 📊 Pinpoint deviation zones
deviation_zones = [s for s in simulated if s['deviation'] is not None and abs(s['deviation']) > 1]

# 🧠 Generate replay narrative using LLM (Ollama Llama 3)
def llm_replay_narrative(deviation_zones):
    prompt = (
        "You are an SRE assistant. Given the following incident deviation zones, generate a replay narrative that:\n"
        "- For each timestamp, describe what should have happened (expected) and what actually occurred (actual).\n"
        "- Highlight key deviations and their impact on related services.\n"
        "- Use concise, stakeholder-friendly language.\n"
        "- Example: 'At 14:03, Service B should have autoscaled to 6 pods. Instead, it stalled at 2, causing cascading latency in Service C.'\n\n"
    )
    for s in deviation_zones:
        prompt += (
            f"Timestamp: {s['timestamp']}, Service: {s['service']}, Metric: {s['metric']}, Expected: {s['expected']}, Actual: {s['actual']}, Deviation: {s['deviation']}\n"
        )
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={"model": "llama3", "prompt": prompt, "stream": False}
    )
    return response.json().get("response", "No response from Llama 3.")

# 📊 Display replay narrative
narrative = llm_replay_narrative(deviation_zones)
display(Markdown("## 🧠 Resilience Replay Narrative\n" + narrative))

## 🧠 Resilience Replay Narrative
**Replay Narrative**

On August 1st, 2025, at 14:00:00, our monitoring systems detected an incident affecting multiple services.

**14:00:00 - Service B**

Expected: Autoscale to 6 replicas to handle increased traffic.
Actual: Stalled at 2 replicas, resulting in a deviation of -4.0.

Impact: Service C started experiencing delays due to the bottleneck in Service B.

**14:03:00 - Service B (continued)**

Expected: Autoscale to 6 replicas to maintain performance.
Actual: Still stalled at 2 replicas, resulting in a deviation of -4.0 for the second consecutive minute.

Impact: Latency continued to increase in Service C, causing user-facing issues.

**14:05:00 - Service C**

Expected: Latency should remain within the threshold (120ms).
Actual: Latency surged to 300ms, with a deviation of 180.0.

Impact: Users experienced significant delays when interacting with Service C, potentially leading to frustration and abandonment.

In summary, an unexpected issue prevented Service B from autoscaling, causing a cascade effect on related services like Service C. We are investigating the root cause and implementing corrective actions to ensure prompt recovery and prevent future incidents.