# Log Analysis & Anomaly Summarization

**Goal:** Turn a day of web/app logs into a structured incident summary (counts, anomalies, and next steps).

**No GPU required.**

You’ll learn:
- Simple log parsing + redaction
- Long-context summarization patterns
- Structured outputs for machine-readable incident reports



## 1. Setup and Installation

You’ll need `openai`, `pydantic`, and `pandas`.


In [None]:
!pip -q install --upgrade openai pydantic pandas

## 2. Imports + API client

In [None]:
import os
import re
from collections import Counter, defaultdict
from typing import List, Dict

import pandas as pd
from pydantic import BaseModel, Field, confloat

from openai import OpenAI

if not os.getenv("OPENAI_API_KEY"):
    raise EnvironmentError("Missing OPENAI_API_KEY. Set it and re-run.")

client = OpenAI()


## 3. Synthetic logs (Nginx + app errors)

We’ll use small synthetic logs so this notebook runs anywhere.
In real usage, you’d load from files / S3 / logging tools.


In [None]:
import random
random.seed(7)

def synth_access_log(n=200):
    endpoints = ["/search", "/locations", "/events", "/api/catalog", "/account/login"]
    statuses = [200, 200, 200, 200, 404, 500, 502]
    agents = ["Mozilla/5.0", "curl/8.0", "NYPLMobile/2.1"]
    rows=[]
    for i in range(n):
        ep = random.choice(endpoints)
        status = random.choice(statuses)
        ms = int(abs(random.gauss(120, 60)))
        if ep == "/search" and random.random() < 0.25:
            status = random.choice([500,502])
            ms = int(abs(random.gauss(900, 250)))
        ip = f"192.0.2.{random.randint(1,254)}"  # RFC 5737 TEST-NET-1
        ua = random.choice(agents)
        rows.append(f'{ip} - - [09/Jan/2026:14:{random.randint(0,59):02d}:{random.randint(0,59):02d} -0500] "GET {ep} HTTP/1.1" {status} 1234 "-" "{ua}" {ms}')
    return rows

def synth_app_log(n=60):
    msgs = [
        "DB timeout in queryCatalog() after 5s",
        "Redis connection refused",
        "Upstream read timeout",
        "JWT validation failed: token expired",
        "Rate limit exceeded for /api/catalog",
        "Template render error in due-date notice",
    ]
    rows=[]
    for i in range(n):
        level = random.choices(["INFO","WARN","ERROR"], weights=[0.55,0.25,0.20])[0]
        msg = random.choice(msgs)
        if "timeout" in msg and random.random() < 0.5:
            level = "ERROR"
        rows.append(f"2026-01-09T14:{random.randint(0,59):02d}:{random.randint(0,59):02d}-05:00 {level} {msg}")
    return rows

access_logs = synth_access_log()
app_logs = synth_app_log()

print(access_logs[0])
print(app_logs[0])


## 4. Redaction helpers

Always remove secrets/tokens before sending logs to a model.
Here we demonstrate simple regex-based redaction.


In [None]:
SECRET_PATTERNS = [
    (re.compile(r"(?i)authorization:\s*bearer\s+[A-Za-z0-9._-]+"), "authorization: Bearer [REDACTED]"),
    (re.compile(r"(?i)api[_-]?key\s*[:=]\s*[A-Za-z0-9._-]+"), "api_key=[REDACTED]"),
    (re.compile(r"(?i)jwt\s+[A-Za-z0-9._-]+\.[A-Za-z0-9._-]+\.[A-Za-z0-9._-]+"), "JWT [REDACTED]"),
]

def redact(text: str) -> str:
    out = text
    for pat, repl in SECRET_PATTERNS:
        out = pat.sub(repl, out)
    return out

# Example
redact("authorization: Bearer abc.def.ghi")


## 5. Quick parsing (counts by status + endpoint)

This is a cheap, deterministic baseline. We’ll feed these stats + samples to the model.


In [None]:
ACCESS_RE = re.compile(r'"GET (?P<path>\S+) HTTP/1\.1" (?P<status>\d{3}) .* (?P<ms>\d+)$')

def parse_access(log_lines: List[str]) -> pd.DataFrame:
    rows=[]
    for line in log_lines:
        m = ACCESS_RE.search(line)
        if not m:
            continue
        rows.append({
            "path": m.group("path"),
            "status": int(m.group("status")),
            "latency_ms": int(m.group("ms")),
        })
    return pd.DataFrame(rows)

df_access = parse_access(access_logs)
df_access.head()


In [None]:
status_counts = df_access["status"].value_counts().to_dict()
endpoint_counts = df_access["path"].value_counts().to_dict()

slow = df_access.sort_values("latency_ms", ascending=False).head(10)

status_counts, list(endpoint_counts.items())[:3], slow.head(3)


## 6. Define an incident summary schema

We want machine-readable output that could power a dashboard, Jira ticket, or Slack post.
Structured Outputs ensure schema adherence.

In [None]:
class EndpointHotspot(BaseModel):
    path: str
    count: int
    notes: str

class Anomaly(BaseModel):
    title: str
    evidence: List[str] = Field(..., description="Short evidence strings based on provided logs/stats.")
    likely_cause: str
    suggested_actions: List[str]

class LogIncidentSummary(BaseModel):
    overview: str
    key_metrics: Dict[str, int] = Field(..., description="Core counts like total_requests, errors_5xx, errors_4xx.")
    top_endpoints: List[EndpointHotspot]
    anomalies: List[Anomaly]
    confidence: confloat(ge=0, le=1)
    followup_questions: List[str] = Field(default_factory=list)


## 7. Long-context prompt (stats + samples)

We’ll send:
- Aggregated stats (fast + cheap)
- A small sample of raw log lines (for flavor/evidence)

Tip: For very large logs, summarize per chunk and then summarize summaries.


In [None]:
SYSTEM = """You are an SRE-style assistant for a library web platform.
Write an incident-style summary based ONLY on the provided stats and log samples.

Rules:
- Do not invent metrics that aren't provided.
- If something is unclear, ask follow-up questions.
- Use the schema exactly.
"""


def build_payload(access_logs: List[str], app_logs: List[str], max_samples: int = 40) -> str:
    # redact + sample
    a = [redact(x) for x in access_logs[:max_samples]]
    b = [redact(x) for x in app_logs[:max_samples]]
    return (
        "AGGREGATED METRICS\n"
        f"status_counts={status_counts}\n"
        f"endpoint_counts={dict(list(endpoint_counts.items())[:10])}\n"
        f"top_slowest={slow.to_dict(orient='records')}\n\n"
        "ACCESS LOG SAMPLES\n" + "\n".join(a) + "\n\n"
        "APP LOG SAMPLES\n" + "\n".join(b)
    )

payload = build_payload(access_logs, app_logs)
payload[:600]


In [None]:
def summarize_logs(payload: str, model: str = "gpt-4o-mini") -> LogIncidentSummary:
    response = client.responses.parse(
        model=model,
        input=[
            {"role": "system", "content": SYSTEM},
            {"role": "user", "content": payload},
        ],
        text_format=LogIncidentSummary,
    )
    return response.output_parsed

summary = summarize_logs(payload)
summary


## 8. Exercises

### EXERCISE 1: Add a redaction for email addresses
Write a regex that redacts `name@example.com`.

### EXERCISE 2: Add chunking for long logs
Implement a function that:
- Splits logs into chunks of N lines
- Summarizes each chunk
- Produces a final summary over chunk summaries

### EXERCISE 3: Add a 'severity' field
Extend the schema with `severity` (low/medium/high) based on 5xx rate and latency.


In [None]:
# EXERCISE STARTER CELL

pass
