# Lab 00e: Visualization & Statistics for Security

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/depalmar/ai_for_the_win/blob/main/notebooks/lab00e_visualization_stats.ipynb)

Master interactive data visualization and statistical analysis for security data.

## Learning Objectives
- Calculate baseline statistics for security metrics
- Create interactive visualizations with Plotly
- Build multi-panel security dashboards
- Analyze time series and distributions
- Apply statistical methods for anomaly context

## Why This Matters

| Challenge | Visualization Solution |
|-----------|------------------------|
| Too much data | Aggregation & filtering |
| Hidden patterns | Time series & correlation |
| Outlier detection | Distribution plots & z-scores |
| Stakeholder reporting | Interactive dashboards |

**No API keys required!**

In [None]:
# Install dependencies (uncomment for Colab)
# !pip install plotly pandas numpy scipy

In [None]:
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from scipy import stats
from datetime import datetime, timedelta
import random

# Plotly template for consistent styling
PLOTLY_TEMPLATE = "plotly_white"

# Security-focused color scheme
COLORS = {
    "primary": "#2E86AB",
    "secondary": "#A23B72",
    "success": "#2ECC71",
    "warning": "#F39C12",
    "danger": "#E74C3C",
    "info": "#3498DB",
}

print("‚úÖ Libraries imported successfully!")
print(f"üìä Plotly version: {px.__version__ if hasattr(px, '__version__') else 'installed'}")

## 1. Generate Sample Security Data

We'll create realistic security event data including:
- Authentication logs with success/failure
- Network traffic with anomalies
- Threat scores for IOCs

In [None]:
# Generate security event data
np.random.seed(42)
random.seed(42)

# Authentication events
n_events = 200
event_types = ["login", "logout", "file_access", "privilege_escalation", "network_scan"]
severities = ["info", "warning", "critical"]
users = ["alice", "bob", "charlie", "admin", "service_account", "unknown"]
ips = [f"192.168.1.{i}" for i in range(1, 20)] + ["10.0.0.50", "10.0.0.51", "203.0.113.5"]

events = []
base_time = datetime(2024, 1, 15, 0, 0, 0)

for i in range(n_events):
    # Create realistic attack pattern: more failures at certain times
    hour = (i * 7) % 24
    is_attack_window = 8 <= hour <= 10  # Attack between 8-10 AM

    if is_attack_window and random.random() < 0.4:
        event_type = random.choice(["login", "privilege_escalation", "network_scan"])
        severity = random.choice(["warning", "critical"])
        success = random.random() < 0.2  # Mostly failures during attack
        source_ip = random.choice(["10.0.0.50", "10.0.0.51", "203.0.113.5"])
        user = random.choice(["admin", "unknown"])
        response_ms = random.randint(100, 500)  # Slow during attack
    else:
        event_type = random.choice(event_types)
        severity = "info" if random.random() < 0.8 else "warning"
        success = random.random() < 0.95  # Normal success rate
        source_ip = random.choice(ips[:18])  # Internal IPs
        user = random.choice(users[:4])
        response_ms = random.randint(10, 80)

    events.append({
        "timestamp": base_time + timedelta(minutes=i*7),
        "event_type": event_type,
        "severity": severity,
        "source_ip": source_ip,
        "user": user,
        "success": success,
        "response_ms": response_ms,
    })

events_df = pd.DataFrame(events)

print(f"üìã Generated {len(events_df)} security events")
print(f"\nEvent types: {events_df['event_type'].value_counts().to_dict()}")
print(f"Severity breakdown: {events_df['severity'].value_counts().to_dict()}")
events_df.head()

In [None]:
# Generate hourly traffic data with anomaly
traffic_data = []
for hour in range(24):
    # Normal traffic pattern (business hours peak)
    if 6 <= hour <= 18:
        base_requests = 400 + 300 * np.sin((hour - 6) * np.pi / 12)
    else:
        base_requests = 100 + 50 * np.random.random()

    # Inject anomaly at hour 14 (DDoS simulation)
    if hour == 14:
        base_requests = 3500  # Massive spike

    requests = int(base_requests + np.random.normal(0, 30))
    bytes_in = requests * random.randint(350, 450)
    bytes_out = requests * random.randint(80, 120)
    errors = max(0, int(requests * 0.01 + np.random.normal(0, 2)))

    # More errors during anomaly
    if hour == 14:
        errors = 150

    traffic_data.append({
        "hour": hour,
        "requests": requests,
        "bytes_in": bytes_in,
        "bytes_out": bytes_out,
        "errors": errors,
    })

traffic_df = pd.DataFrame(traffic_data)

print("üìä Traffic Statistics:")
print(traffic_df.describe().round(2))

In [None]:
# Generate threat scores (mix of benign and malicious)
n_scores = 100

# Most are low threat (benign)
benign_scores = np.random.beta(2, 8, int(n_scores * 0.7)) * 0.4
# Some medium threat
medium_scores = np.random.uniform(0.3, 0.7, int(n_scores * 0.15))
# Few high threat (malicious)
malicious_scores = np.random.beta(8, 2, int(n_scores * 0.15)) * 0.3 + 0.7

threat_scores = np.concatenate([benign_scores, medium_scores, malicious_scores])
np.random.shuffle(threat_scores)

print(f"üéØ Generated {len(threat_scores)} threat scores")
print(f"   Low threat (< 0.3): {(threat_scores < 0.3).sum()}")
print(f"   Medium threat (0.3-0.7): {((threat_scores >= 0.3) & (threat_scores < 0.7)).sum()}")
print(f"   High threat (>= 0.7): {(threat_scores >= 0.7).sum()}")

## 2. Statistical Analysis for Security

Key statistics help establish baselines and detect anomalies:

| Statistic | Formula | Security Use |
|-----------|---------|---------------|
| Mean | Œ£x/n | Average baseline |
| Median | Middle value | Robust to outliers |
| Std Dev | ‚àö(Œ£(x-Œº)¬≤/n) | Variability measure |
| Z-Score | (x-Œº)/œÉ | Anomaly magnitude |
| Percentile | P95, P99 | SLA thresholds |

In [None]:
def calculate_baseline_stats(values):
    """Calculate baseline statistics for security metrics."""
    arr = np.array(values)
    return {
        "mean": float(np.mean(arr)),
        "median": float(np.median(arr)),
        "std": float(np.std(arr)),
        "min": float(np.min(arr)),
        "max": float(np.max(arr)),
        "p95": float(np.percentile(arr, 95)),
        "p99": float(np.percentile(arr, 99)),
    }

# Analyze traffic baseline
requests = traffic_df["requests"].tolist()
baseline = calculate_baseline_stats(requests)

print("üìä Traffic Baseline Statistics")
print("=" * 40)
for key, value in baseline.items():
    print(f"  {key:>8}: {value:>10.2f}")

print(f"\nüí° Insight: Traffic above {baseline['p95']:.0f} requests/hour is unusual (top 5%)")

In [None]:
# Z-Score analysis for anomaly detection
z_scores = stats.zscore(traffic_df["requests"])
traffic_df["z_score"] = z_scores
traffic_df["is_anomaly"] = abs(z_scores) > 2

print("üîç Z-Score Anomaly Detection")
print("=" * 40)
print("Z-score thresholds:")
print("  |z| > 2: Unusual (95% confidence)")
print("  |z| > 3: Extreme outlier (99.7% confidence)")

anomalies = traffic_df[traffic_df["is_anomaly"]]
print(f"\n‚ö†Ô∏è  Anomalies detected: {len(anomalies)}")

for _, row in anomalies.iterrows():
    severity = "üî¥ EXTREME" if abs(row["z_score"]) > 3 else "üü° UNUSUAL"
    print(f"  Hour {row['hour']:2d}: {row['requests']:,} requests (z={row['z_score']:.2f}) {severity}")

## 3. Distribution Visualization

Understanding data distributions helps identify:
- Normal vs abnormal patterns
- Outliers and anomalies
- Class imbalance in threat data

In [None]:
# Threat score distribution with categorization
threat_df = pd.DataFrame({"score": threat_scores})
threat_df["threat_level"] = pd.cut(
    threat_df["score"],
    bins=[0, 0.3, 0.7, 1.0],
    labels=["Low", "Medium", "High"],
    include_lowest=True,
)

# Calculate statistics
mean_score = np.mean(threat_scores)
median_score = np.median(threat_scores)

fig = px.histogram(
    threat_df,
    x="score",
    color="threat_level",
    nbins=25,
    title="üéØ Threat Score Distribution with Risk Categorization",
    template=PLOTLY_TEMPLATE,
    color_discrete_map={
        "Low": COLORS["success"],
        "Medium": COLORS["warning"],
        "High": COLORS["danger"],
    },
)

# Add statistical reference lines
fig.add_vline(x=mean_score, line_dash="dash", line_color=COLORS["primary"],
              annotation_text=f"Mean: {mean_score:.2f}", annotation_position="top")
fig.add_vline(x=median_score, line_dash="dot", line_color=COLORS["secondary"],
              annotation_text=f"Median: {median_score:.2f}", annotation_position="bottom")

# Add threshold lines
fig.add_vline(x=0.3, line_dash="solid", line_color="gray", line_width=1)
fig.add_vline(x=0.7, line_dash="solid", line_color="gray", line_width=1)

fig.update_layout(
    xaxis_title="Threat Score (0-1)",
    yaxis_title="Count",
    legend_title="Risk Level",
    height=450,
    bargap=0.1,
)

fig.show()

print("\nüìä Distribution Insights:")
print(f"   Skewness: {stats.skew(threat_scores):.2f} (positive = right-skewed, many low values)")
print(f"   Kurtosis: {stats.kurtosis(threat_scores):.2f} (high = heavy tails, more outliers)")

In [None]:
# Box plot: Response time by event type
fig = px.box(
    events_df,
    x="event_type",
    y="response_ms",
    color="severity",
    title="üì¶ Response Time Distribution by Event Type and Severity",
    template=PLOTLY_TEMPLATE,
    points="outliers",
    color_discrete_map={
        "info": COLORS["info"],
        "warning": COLORS["warning"],
        "critical": COLORS["danger"],
    },
)

# Add SLA threshold
fig.add_hline(y=100, line_dash="dash", line_color=COLORS["danger"],
              annotation_text="SLA: 100ms", annotation_position="right")

fig.update_layout(
    xaxis_title="Event Type",
    yaxis_title="Response Time (ms)",
    height=450,
    legend_title="Severity",
)

fig.show()

# Calculate SLA violations
sla_violations = events_df[events_df["response_ms"] > 100]
print(f"\n‚ö†Ô∏è  SLA Violations (>100ms): {len(sla_violations)} events ({100*len(sla_violations)/len(events_df):.1f}%)")
print(f"   By severity: {sla_violations['severity'].value_counts().to_dict()}")

## 4. Time Series Visualization

Time series help identify:
- Attack timing patterns
- Baseline vs anomalous periods
- Trend analysis

In [None]:
# Traffic timeline with anomaly highlighting
fig = make_subplots(specs=[[{"secondary_y": True}]])

# Normal traffic
normal_traffic = traffic_df[~traffic_df["is_anomaly"]]
anomaly_traffic = traffic_df[traffic_df["is_anomaly"]]

# Requests line
fig.add_trace(
    go.Scatter(
        x=traffic_df["hour"],
        y=traffic_df["requests"],
        name="Requests",
        mode="lines+markers",
        line=dict(color=COLORS["primary"], width=2),
        marker=dict(size=8),
        hovertemplate="Hour %{x}<br>Requests: %{y:,}<extra></extra>",
    ),
    secondary_y=False,
)

# Anomaly markers
if not anomaly_traffic.empty:
    fig.add_trace(
        go.Scatter(
            x=anomaly_traffic["hour"],
            y=anomaly_traffic["requests"],
            name="‚ö†Ô∏è Anomaly",
            mode="markers",
            marker=dict(color=COLORS["danger"], size=18, symbol="x", line=dict(width=2)),
            customdata=anomaly_traffic["z_score"].values,
            hovertemplate="‚ö†Ô∏è ANOMALY<br>Hour %{x}<br>Requests: %{y:,}<br>Z-score: %{customdata:.2f}<extra></extra>",
        ),
        secondary_y=False,
    )

# Error rate
fig.add_trace(
    go.Scatter(
        x=traffic_df["hour"],
        y=traffic_df["errors"],
        name="Errors",
        mode="lines+markers",
        line=dict(color=COLORS["warning"], width=2, dash="dot"),
        marker=dict(size=6),
        hovertemplate="Hour %{x}<br>Errors: %{y}<extra></extra>",
    ),
    secondary_y=True,
)

# Add baseline reference
fig.add_hline(y=baseline["mean"], line_dash="dash", line_color="gray",
              annotation_text=f"Baseline: {baseline['mean']:.0f}", annotation_position="right",
              secondary_y=False)

fig.update_layout(
    title="üìà Network Traffic Over 24 Hours (with Anomaly Detection)",
    template=PLOTLY_TEMPLATE,
    height=500,
    legend=dict(yanchor="top", y=0.99, xanchor="left", x=0.01),
    hovermode="x unified",
)

fig.update_xaxes(title_text="Hour of Day", dtick=2)
fig.update_yaxes(title_text="Request Count", secondary_y=False)
fig.update_yaxes(title_text="Error Count", secondary_y=True)

fig.show()

In [None]:
# Event timeline by severity
events_df["hour"] = events_df["timestamp"].dt.hour
hourly_severity = events_df.groupby(["hour", "severity"]).size().unstack(fill_value=0)

fig = go.Figure()

for severity in ["info", "warning", "critical"]:
    if severity in hourly_severity.columns:
        fig.add_trace(
            go.Bar(
                x=hourly_severity.index,
                y=hourly_severity[severity],
                name=severity.capitalize(),
                marker_color={
                    "info": COLORS["info"],
                    "warning": COLORS["warning"],
                    "critical": COLORS["danger"],
                }[severity],
            )
        )

fig.update_layout(
    title="‚è∞ Security Events by Hour and Severity (Stacked)",
    template=PLOTLY_TEMPLATE,
    barmode="stack",
    xaxis_title="Hour of Day",
    yaxis_title="Event Count",
    height=400,
    legend_title="Severity",
)

fig.show()

# Find attack window
critical_by_hour = events_df[events_df["severity"] == "critical"].groupby("hour").size()
if not critical_by_hour.empty:
    peak_hour = critical_by_hour.idxmax()
    print(f"\nüö® Attack Window Detected: Hour {peak_hour} has highest critical events ({critical_by_hour.max()})")

## 5. Correlation Analysis

Correlation helps identify:
- Related security metrics
- Feature selection for ML
- Attack indicators

In [None]:
# Improved Correlation Heatmap - easier to interpret
corr_cols = ["requests", "bytes_in", "bytes_out", "errors"]
corr_matrix = traffic_df[corr_cols].corr()

# Human-readable labels
label_map = {
    "requests": "Requests",
    "bytes_in": "Bytes In",
    "bytes_out": "Bytes Out",
    "errors": "Errors"
}
display_labels = [label_map[c] for c in corr_cols]

# Mask upper triangle (correlation matrix is symmetric)
mask = np.triu(np.ones_like(corr_matrix, dtype=bool), k=1)
masked_corr = corr_matrix.copy()
masked_corr = masked_corr.where(~mask)

# Create annotation text with significance markers
# *** = very strong (>0.9), ** = strong (>0.7), * = moderate (>0.5)
def annotate_corr(val):
    if pd.isna(val):
        return ""
    abs_val = abs(val)
    stars = "***" if abs_val > 0.9 else "**" if abs_val > 0.7 else "*" if abs_val > 0.5 else ""
    return f"{val:.2f}{stars}"

annotations = [[annotate_corr(masked_corr.iloc[i, j]) for j in range(len(corr_cols))]
               for i in range(len(corr_cols))]

fig = go.Figure(
    data=go.Heatmap(
        z=masked_corr.values,
        x=display_labels,
        y=display_labels,
        colorscale="RdBu_r",
        zmid=0,
        zmin=-1,
        zmax=1,
        text=annotations,
        texttemplate="%{text}",
        textfont={"size": 16, "color": "black"},
        hovertemplate="<b>%{y}</b> vs <b>%{x}</b><br>Correlation: %{z:.3f}<extra></extra>",
        colorbar=dict(
            title="Correlation",
            titleside="right",
            tickvals=[-1, -0.5, 0, 0.5, 1],
            ticktext=["‚àí1 (inverse)", "‚àí0.5", "0 (none)", "+0.5", "+1 (strong)"],
        ),
    )
)

fig.update_layout(
    title=dict(
        text="üî• Feature Correlation Matrix<br><sup>Stars indicate strength: *** >0.9, ** >0.7, * >0.5</sup>",
        font=dict(size=16),
    ),
    template=PLOTLY_TEMPLATE,
    height=500,
    width=600,
    xaxis=dict(side="bottom", tickangle=0),
    yaxis=dict(autorange="reversed"),  # Match matrix convention
)

fig.show()

# Interpretation guide
print("\nüìä How to Read This Matrix:")
print("   ‚Ä¢ Diagonal = 1.00 (variable correlates perfectly with itself)")
print("   ‚Ä¢ Upper triangle hidden (matrix is symmetric)")
print("   ‚Ä¢ Blue = positive correlation (both increase together)")
print("   ‚Ä¢ Red = negative correlation (one increases, other decreases)")
print("   ‚Ä¢ Stars = strength: *** very strong, ** strong, * moderate")

# Key findings
print("\nüîç Key Correlations Found:")
for i, col1 in enumerate(corr_cols):
    for j, col2 in enumerate(corr_cols):
        if i > j:  # Lower triangle only
            corr = corr_matrix.loc[col1, col2]
            if abs(corr) > 0.5:
                strength = "Very strong" if abs(corr) > 0.9 else "Strong" if abs(corr) > 0.7 else "Moderate"
                direction = "‚Üë‚Üë" if corr > 0 else "‚Üë‚Üì"
                meaning = "increase together" if corr > 0 else "inverse relationship"
                print(f"   {label_map[col1]} ‚Üî {label_map[col2]}: {strength} ({corr:.2f}) {direction}")
                print(f"      ‚Üí {meaning}")

In [None]:
# Scatter plot with correlation - bytes analysis
traffic_df["error_rate"] = traffic_df["errors"] / traffic_df["requests"] * 100

fig = px.scatter(
    traffic_df,
    x="requests",
    y="errors",
    size="bytes_in",
    color="is_anomaly",
    title="üîó Request vs Error Analysis (size = bytes, color = anomaly)",
    template=PLOTLY_TEMPLATE,
    color_discrete_map={True: COLORS["danger"], False: COLORS["primary"]},
    hover_data=["hour", "error_rate"],
)

# Add trend line for normal traffic
normal = traffic_df[~traffic_df["is_anomaly"]]
z = np.polyfit(normal["requests"], normal["errors"], 1)
p = np.poly1d(z)
x_line = np.linspace(normal["requests"].min(), normal["requests"].max(), 100)

fig.add_trace(
    go.Scatter(
        x=x_line,
        y=p(x_line),
        mode="lines",
        name="Expected Error Rate",
        line=dict(dash="dash", color="gray"),
    )
)

fig.update_layout(
    xaxis_title="Requests",
    yaxis_title="Errors",
    height=450,
    legend_title="Is Anomaly",
)

fig.show()

## 6. Security Dashboard

Combine multiple visualizations into a comprehensive SOC dashboard.

In [None]:
# Create comprehensive security dashboard
fig = make_subplots(
    rows=2,
    cols=3,
    subplot_titles=[
        "üìà Traffic Timeline",
        "üéØ Threat Score Distribution",
        "‚ö†Ô∏è Events by Severity",
        "üåê Top Source IPs",
        "üìä Event Types",
        "‚è±Ô∏è Response Time",
    ],
    specs=[
        [{}, {}, {}],
        [{}, {}, {}],
    ],
    vertical_spacing=0.15,
    horizontal_spacing=0.08,
)

# 1. Traffic timeline
fig.add_trace(
    go.Scatter(
        x=traffic_df["hour"],
        y=traffic_df["requests"],
        mode="lines+markers",
        name="Requests",
        line=dict(color=COLORS["primary"]),
        showlegend=False,
    ),
    row=1, col=1,
)

# 2. Threat score histogram
fig.add_trace(
    go.Histogram(
        x=threat_scores,
        nbinsx=20,
        marker_color=COLORS["warning"],
        showlegend=False,
    ),
    row=1, col=2,
)

# 3. Severity pie chart
severity_counts = events_df["severity"].value_counts()
fig.add_trace(
    go.Bar(
        x=severity_counts.index,
        y=severity_counts.values,
        marker_color=[COLORS["info"], COLORS["warning"], COLORS["danger"]][:len(severity_counts)],
        showlegend=False,
    ),
    row=1, col=3,
)

# 4. Top source IPs
ip_counts = events_df["source_ip"].value_counts().head(8)
fig.add_trace(
    go.Bar(
        x=ip_counts.values,
        y=ip_counts.index,
        orientation="h",
        marker_color=COLORS["secondary"],
        showlegend=False,
    ),
    row=2, col=1,
)

# 5. Event type distribution
event_counts = events_df["event_type"].value_counts()
fig.add_trace(
    go.Bar(
        x=event_counts.index,
        y=event_counts.values,
        marker_color=COLORS["info"],
        showlegend=False,
    ),
    row=2, col=2,
)

# 6. Response time box
fig.add_trace(
    go.Box(
        y=events_df["response_ms"],
        marker_color=COLORS["primary"],
        showlegend=False,
    ),
    row=2, col=3,
)

# Update layout
fig.update_layout(
    title=dict(
        text="üîí Security Operations Center Dashboard",
        font=dict(size=22),
    ),
    template=PLOTLY_TEMPLATE,
    height=650,
    width=1100,
    showlegend=False,
)

fig.show()

print("\nüìã Dashboard Summary:")
print(f"   Total Events: {len(events_df)}")
print(f"   Critical Events: {(events_df['severity'] == 'critical').sum()}")
print(f"   Traffic Anomalies: {traffic_df['is_anomaly'].sum()}")
print(f"   High-Risk Scores: {(threat_scores >= 0.7).sum()}")

## 7. Advanced: Attack Timeline Reconstruction

Use visualization to reconstruct attack progression.

In [None]:
# Attack timeline with event progression
events_df["severity_num"] = events_df["severity"].map({"info": 1, "warning": 2, "critical": 3})

fig = px.scatter(
    events_df,
    x="timestamp",
    y="event_type",
    color="severity",
    size="severity_num",
    title="üïê Attack Timeline Reconstruction",
    template=PLOTLY_TEMPLATE,
    color_discrete_map={
        "info": COLORS["info"],
        "warning": COLORS["warning"],
        "critical": COLORS["danger"],
    },
    hover_data=["source_ip", "user", "success", "response_ms"],
)

fig.update_layout(
    xaxis_title="Time",
    yaxis_title="Event Type",
    height=450,
    legend_title="Severity",
)

fig.show()

# Attack chain analysis
print("\nüîç Attack Chain Analysis:")
critical_events = events_df[events_df["severity"] == "critical"].sort_values("timestamp")
for i, (_, event) in enumerate(critical_events.head(5).iterrows()):
    print(f"   Step {i+1}: {event['event_type']} from {event['source_ip']} ({event['timestamp'].strftime('%H:%M')})")

## Summary

### What You Learned

| Skill | Application |
|-------|-------------|
| Baseline Statistics | Establishing normal behavior |
| Z-Score Analysis | Anomaly detection and scoring |
| Distribution Plots | Understanding data spread |
| Time Series | Attack timeline analysis |
| Correlation Heatmaps | Feature relationships |
| Dashboards | SOC operations overview |

### Key Takeaways

1. **Always start with statistics** before visualization
2. **Use appropriate chart types** for your data and question
3. **Add interactivity** for exploration (hover, zoom, filter)
4. **Highlight anomalies** to draw attention to issues
5. **Combine views** in dashboards for comprehensive monitoring

### Next Steps

- **Lab 01**: Apply to phishing classification (confusion matrix, ROC)
- **Lab 02**: Visualize malware clustering (t-SNE, PCA)
- **Lab 03**: Build anomaly detection dashboards