# Log Analysis & Anomaly Summarization

Goal: summarize noisy logs into an **incident-ready report** with counts, suspected causes, and next actions.

What you’ll practice:
- Parsing logs with Python (baseline)
- Long-context prompting pattern (stats + samples)
- Structured incident report outputs
- Batch/day partitioning pattern


## 1. Setup and Installation

**Estimated time:** ~60–90 minutes (with exercises)

### Install
If needed, install dependencies:
```bash
pip install -U openai pydantic pandas numpy scikit-learn
```

### Environment
Set your API key:
```bash
export OPENAI_API_KEY="..."
```

> **Note:** All example data in this notebook is synthetic (safe to share in training).

In [None]:
import os

assert os.getenv('OPENAI_API_KEY'), "Set OPENAI_API_KEY in your environment"

## 2. Imports + API client

In [None]:
from openai import OpenAI

client = OpenAI()  # uses OPENAI_API_KEY from env

In [None]:
from pydantic import BaseModel, Field
from typing import List, Literal, Optional, Dict
import random
import pandas as pd
import re
from collections import Counter, defaultdict


## 3. Generate synthetic logs

We’ll simulate a day of Nginx access logs + app error logs.


In [None]:
random.seed(7)

def make_access_line(ts, status, path, rt_ms, ip="10.2.3.4", ua="Mozilla/5.0"):
    return f'{ip} - - [{ts}] "GET {path} HTTP/1.1" {status} 1234 "-" "{ua}" {rt_ms}ms'

def make_error_line(ts, level, msg, svc="api"):
    return f'[{ts}] {svc} {level}: {msg}'

paths = ["/", "/search", "/digital-collections/search", "/login", "/account/reset"]
statuses = [200,200,200,404,500,502]
error_msgs = [
    "upstream timeout while reading response header",
    "database connection pool exhausted",
    "cache miss storm detected",
    "token validation failed",
    "rate limit from upstream dependency"
]

def gen_logs(n=800):
    logs=[]
    for i in range(n):
        ts=f"09/Jan/2026:12:{i%60:02d}:{(i*7)%60:02d} -0500"
        path=random.choice(paths)
        status=random.choices(statuses, weights=[70,70,70,10,5,5])[0]
        rt_ms=int(max(5, random.gauss(120, 80)))
        logs.append(make_access_line(ts, status, path, rt_ms))
        # sprinkle errors
        if status in (500,502) and random.random() < 0.7:
            logs.append(make_error_line(ts, "ERROR", random.choice(error_msgs)))
    return logs

logs = gen_logs()
print("Total lines:", len(logs))
print("\n".join(logs[:5]))

## 4. Parse logs (baseline)

We’ll extract:
- counts by status
- top endpoints for 5xx
- rough latency percentiles
- error message counts


In [None]:
ACCESS_RE = re.compile(r'\] "GET (?P<path>[^ ]+) HTTP/1\.1" (?P<status>\d{3}).* (?P<rt>\d+)ms$')
ERROR_RE  = re.compile(r'\] (?P<svc>\w+) (?P<level>\w+): (?P<msg>.*)$')

access=[]
errors=[]
for line in logs:
    m=ACCESS_RE.search(line)
    if m:
        access.append({"path": m.group("path"), "status": int(m.group("status")), "rt_ms": int(m.group("rt"))})
        continue
    m2=ERROR_RE.search(line)
    if m2:
        errors.append({"svc": m2.group("svc"), "level": m2.group("level"), "msg": m2.group("msg")})

df_access = pd.DataFrame(access)
df_errors = pd.DataFrame(errors)

df_access.head(), df_errors.head()

In [None]:
status_counts = df_access["status"].value_counts().sort_index()
top_5xx_paths = df_access[df_access["status"].isin([500,502])]["path"].value_counts().head(10)
lat_pcts = df_access["rt_ms"].quantile([0.5,0.9,0.95,0.99]).to_dict()
err_counts = df_errors["msg"].value_counts().head(10)

status_counts, top_5xx_paths, lat_pcts, err_counts

## 5. Define a structured incident report schema

This is what you’d hand to an on-call engineer or attach to a ticket.


In [None]:
class IncidentReport(BaseModel):
    summary: str
    severity: Literal["SEV1","SEV2","SEV3","INFO"]
    suspected_root_causes: List[str]
    key_metrics: Dict[str, str]
    top_errors: List[str]
    top_impacted_endpoints: List[str]
    recommended_actions: List[str]
    confidence: float = Field(..., ge=0, le=1)
    needs_human_review: bool

## 6. Long-context prompt (stats + samples)

We give the model:
1) aggregate stats (cheap, informative)
2) a small sample of raw lines (for flavor)
3) clear constraints + schema


In [None]:
SYSTEM = """You are an SRE assistant. Produce an incident report from log stats + samples.
Be conservative: if evidence is weak, set needs_human_review=true and confidence<=0.6.
Avoid hallucinating: only infer causes supported by the provided errors/metrics."""

def build_input(status_counts, top_5xx_paths, lat_pcts, err_counts, sample_lines):
    return f"""Log stats:
Status counts: {status_counts.to_dict()}
Top 5xx endpoints: {top_5xx_paths.to_dict()}
Latency percentiles (ms): {lat_pcts}
Top error messages: {err_counts.to_dict()}

Sample raw lines (first 40):
{chr(10).join(sample_lines)}
"""

report = client.responses.parse(
    model="gpt-4o-2024-08-06",
    input=[
        {"role":"system","content": SYSTEM},
        {"role":"user","content": build_input(status_counts, top_5xx_paths, lat_pcts, err_counts, logs[:40])}
    ],
    text_format=IncidentReport
).output_parsed

report

## 7. Turn the report into a ticket-ready payload

Often you’ll want a concise block for a Jira/ServiceNow update.


In [None]:
ticket_update = f"""INCIDENT SUMMARY
Severity: {report.severity}
Summary: {report.summary}

Key metrics:
- """ + "\n- ".join([f"{k}: {v}" for k,v in report.key_metrics.items()]) + f"""

Top errors:
- """ + "\n- ".join(report.top_errors[:5]) + f"""

Recommended actions:
- """ + "\n- ".join(report.recommended_actions[:5]) + """
"""

print(ticket_update)

## 8. Exercises

These extend the notebook into production-ish patterns.


In [None]:
# EXERCISE
# Add a redaction step that replaces email addresses with '<EMAIL>' and long numeric IDs (8+ digits) with '<ID>' before sampling raw lines.

def redact(line: str) -> str:
    # TODO: implement redaction
    raise NotImplementedError("TODO")

# Apply redact() to logs and rebuild the sample_lines
raise NotImplementedError("TODO")


In [None]:
# EXERCISE
# Compute an 'anomaly score' for each endpoint: (5xx_rate * median_latency). Print top 5 endpoints by score.

# TODO: compute per-endpoint 5xx rate and median latency, then score
raise NotImplementedError("TODO")


In [None]:
# EXERCISE
# Modify the model prompt to include the anomaly scores and ask it to reference the highest-scoring endpoint in the report.

# TODO: add anomaly score summary into build_input and re-run report generation
raise NotImplementedError("TODO")
