# Lab 03: Next-Gen Firewall (NGFW) Traffic Anomaly Detection

Build a **Next-Gen Firewall (NGFW)** anomaly detection system using real firewall traffic logs with **Layer 7 deep packet inspection (DPI)**.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/depalmar/ai_for_the_win/blob/main/notebooks/lab03_anomaly_detection.ipynb)

## Learning Objectives
- Parse and analyze **firewall traffic logs** (not NetFlow)
- **Layer 7 feature engineering** (HTTP, DNS, TLS metadata)
- Firewall-specific features: actions, zones, rule IDs, threat categories
- Isolation Forest for anomaly detection
- One-Class SVM and Local Outlier Factor

## Firewall vs NetFlow

| Feature | NetFlow | Firewall Logs |
|---------|---------|---------------|
| Data source | Router/switch | Firewall appliance |
| Granularity | Aggregated flows | Per-session/packet |
| **Action visibility** | ‚ùå None | ‚úÖ Allow/Deny/Drop |
| **L7 inspection** | ‚ùå Limited | ‚úÖ Full DPI |
| **Threat detection** | ‚ùå None | ‚úÖ IPS/AV verdicts |
| Zone info | ‚ùå None | ‚úÖ Trust/Untrust/DMZ |

## NGFW Deep Packet Inspection (7 Protocols)

| Protocol | Inspection Features |
|----------|---------------------|
| **HTTP** | User-agents, methods, response codes, content-type |
| **DNS** | Query types, domain entropy, TXT sizes |
| **TLS** | JA3/JA4 fingerprints, cert validity, SNI |
| **QUIC/HTTP3** | Version, 0-RTT, connection migration |
| **gRPC** | Method type, status codes, message sizes |
| **WebSocket** | Frame frequency, message entropy, binary ratio |
| **RMM Tools** | Duration risk, zone anomaly, multi-target (AnyDesk, TeamViewer, etc.) |

### Why RMM Detection Matters
Threat actors (LAPSUS$, Conti, BlackCat) abuse legitimate RMM tools for:
- **Initial Access** - Phishing ‚Üí RMM install
- **Persistence** - Survives reboots, looks legitimate
- **C2** - Encrypted, blends with IT traffic
- **Lateral Movement** - Connect to multiple internal hosts

In [None]:
# Install dependencies (uncomment for Colab)
# !pip install scikit-learn pandas numpy matplotlib seaborn plotly

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.ensemble import IsolationForest
from sklearn.svm import OneClassSVM
from sklearn.neighbors import LocalOutlierFactor
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix

# Plotly for interactive visualizations
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

plt.style.use("seaborn-v0_8-whitegrid")
np.random.seed(42)

# Plotly template for Colab
PLOTLY_TEMPLATE = "plotly_white"

## 1. Generate Firewall Traffic Logs with L7 Metadata

In [None]:
# Generate NGFW firewall traffic logs with diverse attack patterns
n_normal = 2000

# Firewall zones
ZONES = ["trust", "untrust", "dmz", "guest"]

# ============================================================
# NORMAL TRAFFIC - Multiple enterprise traffic profiles
# ============================================================

# Web browsing (HTTP/HTTPS)
n_web = 600
web_traffic = {
    "bytes_sent": np.random.lognormal(7, 0.8, n_web),  # Requests
    "bytes_recv": np.random.lognormal(10, 1.5, n_web),  # Responses (pages, images)
    "packets_sent": np.random.poisson(30, n_web),
    "packets_recv": np.random.poisson(100, n_web),
    "duration": np.random.exponential(3, n_web),
    "dst_port": np.random.choice([80, 443], n_web, p=[0.2, 0.8]),
    "protocol": np.full(n_web, "TCP"),
    "src_ip_count": np.ones(n_web, dtype=int),
    "dst_ip_count": np.ones(n_web, dtype=int),
    # Firewall-specific fields
    "action": np.full(n_web, "allow"),
    "src_zone": np.random.choice(["trust", "guest"], n_web, p=[0.8, 0.2]),
    "dst_zone": np.full(n_web, "untrust"),
    "rule_id": np.random.choice([101, 102, 103], n_web),  # Web access rules
    "app_id": np.random.choice(["web-browsing", "ssl", "google-base"], n_web),
    "threat_category": np.full(n_web, "none"),
    "url_category": np.random.choice(["business", "news", "shopping", "technology"], n_web),
    "attack_type": "normal",
    "label": 0,
}

# Email traffic (SMTP/IMAP/POP3)
n_email = 200
email_traffic = {
    "bytes_sent": np.random.lognormal(8, 1.0, n_email),
    "bytes_recv": np.random.lognormal(9, 1.2, n_email),
    "packets_sent": np.random.poisson(40, n_email),
    "packets_recv": np.random.poisson(60, n_email),
    "duration": np.random.exponential(2, n_email),
    "dst_port": np.random.choice([25, 587, 993, 995, 143], n_email),
    "protocol": np.full(n_email, "TCP"),
    "src_ip_count": np.ones(n_email, dtype=int),
    "dst_ip_count": np.ones(n_email, dtype=int),
    # Firewall-specific fields
    "action": np.full(n_email, "allow"),
    "src_zone": np.full(n_email, "trust"),
    "dst_zone": np.full(n_email, "untrust"),
    "rule_id": np.full(n_email, 201),  # Email rule
    "app_id": np.random.choice(["smtp", "imap", "pop3"], n_email),
    "threat_category": np.full(n_email, "none"),
    "url_category": np.full(n_email, "none"),
    "attack_type": "normal",
    "label": 0,
}

# DNS queries (normal)
n_dns = 400
dns_traffic = {
    "bytes_sent": np.random.normal(70, 15, n_dns).clip(40, 200),
    "bytes_recv": np.random.normal(150, 40, n_dns).clip(80, 400),
    "packets_sent": np.ones(n_dns, dtype=int),  # Single query
    "packets_recv": np.random.choice([1, 2, 3], n_dns, p=[0.7, 0.2, 0.1]),
    "duration": np.random.uniform(0.001, 0.2, n_dns),  # Fast
    "dst_port": np.full(n_dns, 53),
    "protocol": np.full(n_dns, "UDP"),
    "src_ip_count": np.ones(n_dns, dtype=int),
    "dst_ip_count": np.ones(n_dns, dtype=int),
    # Firewall-specific fields
    "action": np.full(n_dns, "allow"),
    "src_zone": np.random.choice(["trust", "dmz", "guest"], n_dns, p=[0.6, 0.2, 0.2]),
    "dst_zone": np.full(n_dns, "untrust"),
    "rule_id": np.full(n_dns, 301),  # DNS rule
    "app_id": np.full(n_dns, "dns"),
    "threat_category": np.full(n_dns, "none"),
    "url_category": np.full(n_dns, "none"),
    "attack_type": "normal",
    "label": 0,
}

# SSH sessions
n_ssh = 100
ssh_traffic = {
    "bytes_sent": np.random.lognormal(9, 1.5, n_ssh),
    "bytes_recv": np.random.lognormal(10, 1.8, n_ssh),
    "packets_sent": np.random.poisson(200, n_ssh),
    "packets_recv": np.random.poisson(250, n_ssh),
    "duration": np.random.exponential(300, n_ssh),  # Long sessions
    "dst_port": np.full(n_ssh, 22),
    "protocol": np.full(n_ssh, "TCP"),
    "src_ip_count": np.ones(n_ssh, dtype=int),
    "dst_ip_count": np.ones(n_ssh, dtype=int),
    # Firewall-specific fields
    "action": np.full(n_ssh, "allow"),
    "src_zone": np.full(n_ssh, "trust"),
    "dst_zone": np.random.choice(["dmz", "untrust"], n_ssh, p=[0.7, 0.3]),
    "rule_id": np.full(n_ssh, 401),  # Admin SSH rule
    "app_id": np.full(n_ssh, "ssh"),
    "threat_category": np.full(n_ssh, "none"),
    "url_category": np.full(n_ssh, "none"),
    "attack_type": "normal",
    "label": 0,
}

# Database connections
n_db = 200
db_traffic = {
    "bytes_sent": np.random.lognormal(7, 1.2, n_db),
    "bytes_recv": np.random.lognormal(11, 1.5, n_db),
    "packets_sent": np.random.poisson(50, n_db),
    "packets_recv": np.random.poisson(150, n_db),
    "duration": np.random.exponential(1, n_db),
    "dst_port": np.random.choice([3306, 5432, 1433, 27017], n_db),
    "protocol": np.full(n_db, "TCP"),
    "src_ip_count": np.ones(n_db, dtype=int),
    "dst_ip_count": np.ones(n_db, dtype=int),
    # Firewall-specific fields
    "action": np.full(n_db, "allow"),
    "src_zone": np.full(n_db, "trust"),
    "dst_zone": np.full(n_db, "dmz"),
    "rule_id": np.full(n_db, 501),  # Database access rule
    "app_id": np.random.choice(["mysql", "postgresql", "mssql", "mongodb"], n_db),
    "threat_category": np.full(n_db, "none"),
    "url_category": np.full(n_db, "none"),
    "attack_type": "normal",
    "label": 0,
}

# API traffic
n_api = 400
api_traffic = {
    "bytes_sent": np.random.lognormal(7, 0.8, n_api),
    "bytes_recv": np.random.lognormal(8, 1.0, n_api),
    "packets_sent": np.random.poisson(10, n_api),
    "packets_recv": np.random.poisson(15, n_api),
    "duration": np.random.exponential(0.5, n_api),
    "dst_port": np.random.choice([443, 8443, 8080], n_api),
    "protocol": np.full(n_api, "TCP"),
    "src_ip_count": np.ones(n_api, dtype=int),
    "dst_ip_count": np.ones(n_api, dtype=int),
    # Firewall-specific fields
    "action": np.full(n_api, "allow"),
    "src_zone": np.random.choice(["trust", "dmz"], n_api, p=[0.6, 0.4]),
    "dst_zone": np.full(n_api, "untrust"),
    "rule_id": np.random.choice([102, 601], n_api),  # API rules
    "app_id": np.random.choice(["ssl", "http2", "rest-api"], n_api),
    "threat_category": np.full(n_api, "none"),
    "url_category": np.random.choice(["cloud-services", "saas", "business"], n_api),
    "attack_type": "normal",
    "label": 0,
}

# QUIC/HTTP3 traffic (UDP-based, Google services, CDNs)
n_quic = 150
quic_traffic = {
    "bytes_sent": np.random.lognormal(7, 0.9, n_quic),
    "bytes_recv": np.random.lognormal(10, 1.3, n_quic),
    "packets_sent": np.random.poisson(25, n_quic),
    "packets_recv": np.random.poisson(80, n_quic),
    "duration": np.random.exponential(2, n_quic),
    "dst_port": np.full(n_quic, 443),  # QUIC uses 443/UDP
    "protocol": np.full(n_quic, "UDP"),  # Key difference: UDP not TCP
    "src_ip_count": np.ones(n_quic, dtype=int),
    "dst_ip_count": np.ones(n_quic, dtype=int),
    # Firewall-specific fields
    "action": np.full(n_quic, "allow"),
    "src_zone": np.random.choice(["trust", "guest"], n_quic, p=[0.7, 0.3]),
    "dst_zone": np.full(n_quic, "untrust"),
    "rule_id": np.full(n_quic, 104),  # QUIC rule
    "app_id": np.random.choice(["quic", "http3", "google-quic"], n_quic),
    "threat_category": np.full(n_quic, "none"),
    "url_category": np.random.choice(["search-engines", "cloud-services", "streaming"], n_quic),
    "attack_type": "normal",
    "label": 0,
}

# gRPC traffic (microservices, internal APIs)
n_grpc = 100
grpc_traffic = {
    "bytes_sent": np.random.lognormal(6, 0.7, n_grpc),  # Protobuf = smaller payloads
    "bytes_recv": np.random.lognormal(7, 0.9, n_grpc),
    "packets_sent": np.random.poisson(15, n_grpc),
    "packets_recv": np.random.poisson(20, n_grpc),
    "duration": np.random.exponential(0.3, n_grpc),  # Fast RPC calls
    "dst_port": np.random.choice([50051, 443, 8080], n_grpc, p=[0.5, 0.3, 0.2]),
    "protocol": np.full(n_grpc, "TCP"),
    "src_ip_count": np.ones(n_grpc, dtype=int),
    "dst_ip_count": np.ones(n_grpc, dtype=int),
    # Firewall-specific fields
    "action": np.full(n_grpc, "allow"),
    "src_zone": np.random.choice(["trust", "dmz"], n_grpc, p=[0.5, 0.5]),
    "dst_zone": np.random.choice(["dmz", "trust"], n_grpc, p=[0.6, 0.4]),
    "rule_id": np.full(n_grpc, 602),  # gRPC/microservices rule
    "app_id": np.full(n_grpc, "grpc"),
    "threat_category": np.full(n_grpc, "none"),
    "url_category": np.full(n_grpc, "none"),
    "attack_type": "normal",
    "label": 0,
}

# WebSocket traffic (real-time apps, chat, trading)
n_ws = 100
websocket_traffic = {
    "bytes_sent": np.random.lognormal(7, 1.2, n_ws),
    "bytes_recv": np.random.lognormal(8, 1.5, n_ws),
    "packets_sent": np.random.poisson(100, n_ws),  # Many small frames
    "packets_recv": np.random.poisson(150, n_ws),
    "duration": np.random.uniform(60, 3600, n_ws),  # Long-lived connections
    "dst_port": np.random.choice([80, 443, 8080], n_ws, p=[0.1, 0.7, 0.2]),
    "protocol": np.full(n_ws, "TCP"),
    "src_ip_count": np.ones(n_ws, dtype=int),
    "dst_ip_count": np.ones(n_ws, dtype=int),
    # Firewall-specific fields
    "action": np.full(n_ws, "allow"),
    "src_zone": np.random.choice(["trust", "guest"], n_ws, p=[0.8, 0.2]),
    "dst_zone": np.full(n_ws, "untrust"),
    "rule_id": np.full(n_ws, 105),  # WebSocket rule
    "app_id": np.full(n_ws, "websocket"),
    "threat_category": np.full(n_ws, "none"),
    "url_category": np.random.choice(["business", "financial", "collaboration"], n_ws),
    "attack_type": "normal",
    "label": 0,
}

# RMM (Remote Monitoring & Management) - Legitimate IT admin use
# Tools: AnyDesk, TeamViewer, ConnectWise, Splashtop, LogMeIn
n_rmm = 80
rmm_traffic = {
    "bytes_sent": np.random.lognormal(9, 1.0, n_rmm),  # Screen sharing = moderate data
    "bytes_recv": np.random.lognormal(10, 1.2, n_rmm),
    "packets_sent": np.random.poisson(150, n_rmm),
    "packets_recv": np.random.poisson(200, n_rmm),
    "duration": np.random.uniform(300, 7200, n_rmm),  # 5min - 2hr sessions
    "dst_port": np.random.choice([443, 7070, 5938, 6568], n_rmm, p=[0.4, 0.2, 0.2, 0.2]),
    "protocol": np.full(n_rmm, "TCP"),
    "src_ip_count": np.ones(n_rmm, dtype=int),
    "dst_ip_count": np.ones(n_rmm, dtype=int),
    # Firewall-specific fields
    "action": np.full(n_rmm, "allow"),
    "src_zone": np.full(n_rmm, "trust"),  # IT admins in trusted zone
    "dst_zone": np.random.choice(["trust", "dmz"], n_rmm, p=[0.7, 0.3]),  # Managing internal hosts
    "rule_id": np.full(n_rmm, 801),  # RMM whitelist rule
    "app_id": np.random.choice(["anydesk", "teamviewer", "connectwise", "splashtop"], n_rmm),
    "threat_category": np.full(n_rmm, "none"),
    "url_category": np.full(n_rmm, "remote-access"),
    "attack_type": "normal",
    "label": 0,
}

# ============================================================
# ATTACK TRAFFIC - Multiple attack categories with MITRE mapping
# ============================================================

# Attack Type 1: PORT SCANNING (T1046 - Network Service Discovery)
n_scan = 50
port_scan = {
    "bytes_sent": np.random.normal(60, 10, n_scan),  # Small SYN packets
    "bytes_recv": np.random.choice([0, 40], n_scan, p=[0.7, 0.3]),  # Mostly no response
    "packets_sent": np.random.randint(100, 1000, n_scan),  # Many probes
    "packets_recv": np.random.randint(0, 100, n_scan),
    "duration": np.random.uniform(1, 30, n_scan),
    "dst_port": np.random.randint(1, 65535, n_scan),  # Random ports
    "protocol": np.full(n_scan, "TCP"),
    "src_ip_count": np.ones(n_scan, dtype=int),
    "dst_ip_count": np.random.randint(50, 500, n_scan),  # Many destinations
    # Firewall-specific fields
    "action": np.random.choice(["deny", "drop", "alert"], n_scan, p=[0.4, 0.4, 0.2]),
    "src_zone": np.random.choice(["untrust", "guest"], n_scan, p=[0.8, 0.2]),
    "dst_zone": np.random.choice(["trust", "dmz"], n_scan, p=[0.6, 0.4]),
    "rule_id": np.full(n_scan, 999),  # Implicit deny rule
    "app_id": np.full(n_scan, "incomplete"),
    "threat_category": np.full(n_scan, "scan"),
    "url_category": np.full(n_scan, "none"),
    "attack_type": "port_scan",
    "label": 1,
}

# Attack Type 2: BRUTE FORCE SSH (T1110 - Brute Force)
n_brute = 40
brute_force = {
    "bytes_sent": np.random.normal(500, 100, n_brute),  # Login attempts
    "bytes_recv": np.random.normal(200, 50, n_brute),
    "packets_sent": np.random.randint(50, 200, n_brute),  # Repeated attempts
    "packets_recv": np.random.randint(50, 200, n_brute),
    "duration": np.random.uniform(60, 600, n_brute),  # Long duration
    "dst_port": np.random.choice([22, 3389, 21, 23], n_brute),  # Auth services
    "protocol": np.full(n_brute, "TCP"),
    "src_ip_count": np.ones(n_brute, dtype=int),
    "dst_ip_count": np.ones(n_brute, dtype=int),
    # Firewall-specific fields
    "action": np.random.choice(["deny", "alert"], n_brute, p=[0.6, 0.4]),
    "src_zone": np.full(n_brute, "untrust"),
    "dst_zone": np.random.choice(["trust", "dmz"], n_brute),
    "rule_id": np.random.choice([401, 999], n_brute),  # SSH rule or deny
    "app_id": np.random.choice(["ssh", "rdp", "ftp"], n_brute),
    "threat_category": np.full(n_brute, "brute-force"),
    "url_category": np.full(n_brute, "none"),
    "attack_type": "brute_force",
    "label": 1,
}

# Attack Type 3: C2 BEACONING (T1071 - Application Layer Protocol)
n_c2 = 50
c2_beacon = {
    "bytes_sent": np.random.normal(256, 50, n_c2),  # Regular beacon size
    "bytes_recv": np.random.normal(512, 100, n_c2),  # Command responses
    "packets_sent": np.random.poisson(5, n_c2),
    "packets_recv": np.random.poisson(8, n_c2),
    "duration": np.random.uniform(0.1, 2, n_c2),  # Short transactions
    "dst_port": np.random.choice([443, 80, 8080, 8443], n_c2),  # Blend with web
    "protocol": np.full(n_c2, "TCP"),
    "src_ip_count": np.ones(n_c2, dtype=int),
    "dst_ip_count": np.ones(n_c2, dtype=int),
    # Firewall-specific fields - C2 often evades detection initially
    "action": np.random.choice(["allow", "alert"], n_c2, p=[0.7, 0.3]),
    "src_zone": np.full(n_c2, "trust"),  # Compromised internal host
    "dst_zone": np.full(n_c2, "untrust"),
    "rule_id": np.random.choice([101, 102], n_c2),  # Allowed web traffic
    "app_id": np.random.choice(["ssl", "web-browsing", "unknown-tcp"], n_c2),
    "threat_category": np.random.choice(["none", "command-and-control"], n_c2, p=[0.6, 0.4]),
    "url_category": np.random.choice(["unknown", "newly-registered", "dynamic-dns"], n_c2),
    "attack_type": "c2_beacon",
    "label": 1,
}

# Attack Type 4: DATA EXFILTRATION (T1048 - Exfiltration Over Alternative Protocol)
n_exfil = 30
data_exfil = {
    "bytes_sent": np.random.lognormal(14, 1, n_exfil),  # Large uploads (10MB+)
    "bytes_recv": np.random.normal(500, 100, n_exfil),  # Small ACKs
    "packets_sent": np.random.randint(1000, 10000, n_exfil),
    "packets_recv": np.random.randint(100, 500, n_exfil),
    "duration": np.random.uniform(60, 3600, n_exfil),  # Long transfers
    "dst_port": np.random.choice([443, 53, 21, 22], n_exfil),
    "protocol": np.full(n_exfil, "TCP"),
    "src_ip_count": np.ones(n_exfil, dtype=int),
    "dst_ip_count": np.ones(n_exfil, dtype=int),
    # Firewall-specific fields
    "action": np.random.choice(["allow", "alert"], n_exfil, p=[0.5, 0.5]),
    "src_zone": np.full(n_exfil, "trust"),
    "dst_zone": np.full(n_exfil, "untrust"),
    "rule_id": np.random.choice([102, 301], n_exfil),
    "app_id": np.random.choice(["ssl", "ftp", "ssh", "dns"], n_exfil),
    "threat_category": np.random.choice(["none", "data-theft"], n_exfil, p=[0.6, 0.4]),
    "url_category": np.random.choice(["file-sharing", "cloud-storage", "unknown"], n_exfil),
    "attack_type": "data_exfil",
    "label": 1,
}

# Attack Type 5: DNS TUNNELING (T1071.004 - DNS)
n_dns_tunnel = 40
dns_tunnel = {
    "bytes_sent": np.random.randint(200, 500, n_dns_tunnel),  # Large DNS queries
    "bytes_recv": np.random.randint(300, 800, n_dns_tunnel),  # Large TXT responses
    "packets_sent": np.random.randint(50, 200, n_dns_tunnel),  # Many queries
    "packets_recv": np.random.randint(50, 200, n_dns_tunnel),
    "duration": np.random.uniform(60, 600, n_dns_tunnel),
    "dst_port": np.full(n_dns_tunnel, 53),
    "protocol": np.full(n_dns_tunnel, "UDP"),
    "src_ip_count": np.ones(n_dns_tunnel, dtype=int),
    "dst_ip_count": np.ones(n_dns_tunnel, dtype=int),
    # Firewall-specific fields
    "action": np.random.choice(["allow", "alert"], n_dns_tunnel, p=[0.4, 0.6]),
    "src_zone": np.full(n_dns_tunnel, "trust"),
    "dst_zone": np.full(n_dns_tunnel, "untrust"),
    "rule_id": np.full(n_dns_tunnel, 301),  # DNS allowed but flagged
    "app_id": np.full(n_dns_tunnel, "dns"),
    "threat_category": np.full(n_dns_tunnel, "dns-tunneling"),
    "url_category": np.full(n_dns_tunnel, "none"),
    "attack_type": "dns_tunnel",
    "label": 1,
}

# Attack Type 6: DDoS VOLUMETRIC (T1498 - Network Denial of Service)
n_ddos = 30
ddos_attack = {
    "bytes_sent": np.random.lognormal(13, 0.5, n_ddos),  # High volume
    "bytes_recv": np.random.normal(0, 10, n_ddos).clip(0),  # Little response
    "packets_sent": np.random.randint(10000, 100000, n_ddos),  # Massive packets
    "packets_recv": np.random.randint(0, 100, n_ddos),
    "duration": np.random.uniform(30, 300, n_ddos),
    "dst_port": np.random.choice([80, 443, 53], n_ddos),
    "protocol": np.random.choice(["TCP", "UDP"], n_ddos),
    "src_ip_count": np.random.randint(100, 1000, n_ddos),  # Spoofed sources
    "dst_ip_count": np.ones(n_ddos, dtype=int),
    # Firewall-specific fields
    "action": np.random.choice(["drop", "deny"], n_ddos, p=[0.7, 0.3]),
    "src_zone": np.full(n_ddos, "untrust"),
    "dst_zone": np.random.choice(["dmz", "trust"], n_ddos),
    "rule_id": np.full(n_ddos, 999),  # Rate limiting or DoS protection
    "app_id": np.random.choice(["incomplete", "unknown-udp", "unknown-tcp"], n_ddos),
    "threat_category": np.full(n_ddos, "flood"),
    "url_category": np.full(n_ddos, "none"),
    "attack_type": "ddos",
    "label": 1,
}

# Attack Type 7: LATERAL MOVEMENT SMB (T1021.002 - SMB/Windows Admin Shares)
n_lateral = 40
lateral_movement = {
    "bytes_sent": np.random.lognormal(10, 1.2, n_lateral),
    "bytes_recv": np.random.lognormal(11, 1.5, n_lateral),
    "packets_sent": np.random.poisson(200, n_lateral),
    "packets_recv": np.random.poisson(250, n_lateral),
    "duration": np.random.exponential(10, n_lateral),
    "dst_port": np.random.choice([445, 135, 139, 5985], n_lateral),  # SMB/RPC/WinRM
    "protocol": np.full(n_lateral, "TCP"),
    "src_ip_count": np.ones(n_lateral, dtype=int),
    "dst_ip_count": np.random.randint(2, 20, n_lateral),  # Multiple internal hosts
    # Firewall-specific fields - internal traffic often allowed
    "action": np.random.choice(["allow", "alert"], n_lateral, p=[0.6, 0.4]),
    "src_zone": np.full(n_lateral, "trust"),
    "dst_zone": np.full(n_lateral, "trust"),  # Internal lateral movement
    "rule_id": np.random.choice([701, 702], n_lateral),  # Internal rules
    "app_id": np.random.choice(["ms-ds-smb", "msrpc", "winrm"], n_lateral),
    "threat_category": np.random.choice(["none", "lateral-movement"], n_lateral, p=[0.5, 0.5]),
    "url_category": np.full(n_lateral, "none"),
    "attack_type": "lateral_movement",
    "label": 1,
}

# Attack Type 8: CRYPTO MINING (T1496 - Resource Hijacking)
n_mining = 30
crypto_mining = {
    "bytes_sent": np.random.normal(1000, 200, n_mining),  # Share submissions
    "bytes_recv": np.random.normal(500, 100, n_mining),  # Work units
    "packets_sent": np.random.poisson(100, n_mining),
    "packets_recv": np.random.poisson(80, n_mining),
    "duration": np.random.uniform(3600, 86400, n_mining),  # Very long (hours)
    "dst_port": np.random.choice([3333, 4444, 8333, 14444], n_mining),  # Mining pools
    "protocol": np.full(n_mining, "TCP"),
    "src_ip_count": np.ones(n_mining, dtype=int),
    "dst_ip_count": np.ones(n_mining, dtype=int),
    # Firewall-specific fields
    "action": np.random.choice(["deny", "alert"], n_mining, p=[0.5, 0.5]),
    "src_zone": np.full(n_mining, "trust"),
    "dst_zone": np.full(n_mining, "untrust"),
    "rule_id": np.full(n_mining, 999),  # Often blocked by category
    "app_id": np.random.choice(["stratum", "bitcoin", "unknown-tcp"], n_mining),
    "threat_category": np.full(n_mining, "cryptocurrency"),
    "url_category": np.full(n_mining, "cryptocurrency"),
    "attack_type": "crypto_mining",
    "label": 1,
}

# Attack Type 9: QUIC C2 TUNNEL (T1572 - Protocol Tunneling)
# Attackers abuse QUIC's encryption to hide C2 traffic
n_quic_c2 = 25
quic_c2 = {
    "bytes_sent": np.random.normal(500, 150, n_quic_c2),
    "bytes_recv": np.random.normal(800, 200, n_quic_c2),
    "packets_sent": np.random.poisson(40, n_quic_c2),
    "packets_recv": np.random.poisson(60, n_quic_c2),
    "duration": np.random.uniform(30, 300, n_quic_c2),  # Regular check-ins
    "dst_port": np.full(n_quic_c2, 443),
    "protocol": np.full(n_quic_c2, "UDP"),
    "src_ip_count": np.ones(n_quic_c2, dtype=int),
    "dst_ip_count": np.ones(n_quic_c2, dtype=int),
    # Firewall-specific fields
    "action": np.random.choice(["allow", "alert"], n_quic_c2, p=[0.6, 0.4]),
    "src_zone": np.full(n_quic_c2, "trust"),
    "dst_zone": np.full(n_quic_c2, "untrust"),
    "rule_id": np.full(n_quic_c2, 104),  # Allowed as QUIC
    "app_id": np.random.choice(["quic", "unknown-udp"], n_quic_c2, p=[0.7, 0.3]),
    "threat_category": np.random.choice(["none", "command-and-control"], n_quic_c2, p=[0.5, 0.5]),
    "url_category": np.random.choice(["unknown", "dynamic-dns", "newly-registered"], n_quic_c2),
    "attack_type": "quic_c2",
    "label": 1,
}

# Attack Type 10: gRPC API ABUSE (T1190 - Exploit Public-Facing Application)
# Malicious requests to internal microservices
n_grpc_abuse = 25
grpc_abuse = {
    "bytes_sent": np.random.lognormal(8, 1.2, n_grpc_abuse),  # Large malicious payloads
    "bytes_recv": np.random.lognormal(9, 1.5, n_grpc_abuse),  # Data extraction
    "packets_sent": np.random.poisson(50, n_grpc_abuse),
    "packets_recv": np.random.poisson(100, n_grpc_abuse),
    "duration": np.random.exponential(5, n_grpc_abuse),
    "dst_port": np.random.choice([50051, 443], n_grpc_abuse, p=[0.6, 0.4]),
    "protocol": np.full(n_grpc_abuse, "TCP"),
    "src_ip_count": np.ones(n_grpc_abuse, dtype=int),
    "dst_ip_count": np.random.randint(3, 15, n_grpc_abuse),  # Hitting multiple services
    # Firewall-specific fields
    "action": np.random.choice(["allow", "alert"], n_grpc_abuse, p=[0.5, 0.5]),
    "src_zone": np.random.choice(["untrust", "dmz"], n_grpc_abuse, p=[0.6, 0.4]),
    "dst_zone": np.random.choice(["dmz", "trust"], n_grpc_abuse, p=[0.7, 0.3]),
    "rule_id": np.random.choice([602, 999], n_grpc_abuse),
    "app_id": np.random.choice(["grpc", "http2", "unknown-tcp"], n_grpc_abuse),
    "threat_category": np.random.choice(["none", "injection", "api-abuse"], n_grpc_abuse, p=[0.3, 0.35, 0.35]),
    "url_category": np.full(n_grpc_abuse, "none"),
    "attack_type": "grpc_abuse",
    "label": 1,
}

# Attack Type 11: WEBSOCKET C2 (T1071.001 - Web Protocols)
# Persistent WebSocket for C2 - hard to detect as looks like normal real-time app
n_ws_c2 = 25
websocket_c2 = {
    "bytes_sent": np.random.normal(300, 100, n_ws_c2),  # Commands
    "bytes_recv": np.random.normal(1000, 300, n_ws_c2),  # Exfil data
    "packets_sent": np.random.poisson(200, n_ws_c2),  # Many heartbeats
    "packets_recv": np.random.poisson(250, n_ws_c2),
    "duration": np.random.uniform(3600, 43200, n_ws_c2),  # Very long (hours)
    "dst_port": np.random.choice([80, 443], n_ws_c2, p=[0.2, 0.8]),
    "protocol": np.full(n_ws_c2, "TCP"),
    "src_ip_count": np.ones(n_ws_c2, dtype=int),
    "dst_ip_count": np.ones(n_ws_c2, dtype=int),
    # Firewall-specific fields
    "action": np.random.choice(["allow", "alert"], n_ws_c2, p=[0.7, 0.3]),
    "src_zone": np.full(n_ws_c2, "trust"),
    "dst_zone": np.full(n_ws_c2, "untrust"),
    "rule_id": np.full(n_ws_c2, 105),
    "app_id": np.random.choice(["websocket", "ssl", "unknown-tcp"], n_ws_c2),
    "threat_category": np.random.choice(["none", "command-and-control"], n_ws_c2, p=[0.4, 0.6]),
    "url_category": np.random.choice(["unknown", "dynamic-dns", "suspicious"], n_ws_c2),
    "attack_type": "websocket_c2",
    "label": 1,
}

# Attack Type 12: RMM ABUSE (T1219 - Remote Access Software)
# Threat actors abuse legitimate RMM tools for persistence and C2
# Real examples: LAPSUS$, Conti, BlackCat all abused AnyDesk/TeamViewer
n_rmm_abuse = 35
rmm_abuse = {
    "bytes_sent": np.random.lognormal(11, 1.5, n_rmm_abuse),  # Large file transfers
    "bytes_recv": np.random.lognormal(10, 1.3, n_rmm_abuse),
    "packets_sent": np.random.poisson(300, n_rmm_abuse),
    "packets_recv": np.random.poisson(350, n_rmm_abuse),
    "duration": np.random.uniform(1800, 28800, n_rmm_abuse),  # 30min - 8hr (persistence)
    "dst_port": np.random.choice([443, 7070, 5938, 6568, 4443], n_rmm_abuse),
    "protocol": np.full(n_rmm_abuse, "TCP"),
    "src_ip_count": np.ones(n_rmm_abuse, dtype=int),
    "dst_ip_count": np.random.randint(1, 5, n_rmm_abuse),  # May connect to multiple targets
    # Firewall-specific fields - often allowed as "legitimate" tool
    "action": np.random.choice(["allow", "alert"], n_rmm_abuse, p=[0.6, 0.4]),
    "src_zone": np.random.choice(["trust", "guest", "untrust"], n_rmm_abuse, p=[0.4, 0.3, 0.3]),
    "dst_zone": np.random.choice(["trust", "untrust"], n_rmm_abuse, p=[0.5, 0.5]),
    "rule_id": np.random.choice([801, 999], n_rmm_abuse, p=[0.4, 0.6]),  # May hit RMM rule or default
    "app_id": np.random.choice(["anydesk", "teamviewer", "screenconnect", "atera", "splashtop"], n_rmm_abuse),
    "threat_category": np.random.choice(["none", "remote-access-trojan", "command-and-control"], n_rmm_abuse, p=[0.3, 0.4, 0.3]),
    "url_category": np.random.choice(["remote-access", "unknown", "newly-registered"], n_rmm_abuse),
    "attack_type": "rmm_abuse",
    "label": 1,
}

# ============================================================
# Combine all traffic
# ============================================================
all_traffic = [
    # Normal
    web_traffic,
    email_traffic,
    dns_traffic,
    ssh_traffic,
    db_traffic,
    api_traffic,
    quic_traffic,      # HTTP/3 QUIC
    grpc_traffic,      # gRPC microservices
    websocket_traffic, # WebSocket real-time
    rmm_traffic,       # RMM tools (AnyDesk, TeamViewer, etc.)
    # Attacks
    port_scan,
    brute_force,
    c2_beacon,
    data_exfil,
    dns_tunnel,
    ddos_attack,
    lateral_movement,
    crypto_mining,
    quic_c2,           # QUIC tunneling attack
    grpc_abuse,        # gRPC API abuse
    websocket_c2,      # WebSocket C2
    rmm_abuse,         # RMM tool abuse (T1219)
]

df = pd.concat([pd.DataFrame(t) for t in all_traffic], ignore_index=True)
df = df.sample(frac=1, random_state=42).reset_index(drop=True)

print(f"üî• NGFW Firewall Traffic Log Statistics:")
print(f"   Total sessions: {len(df)}")
print(f"   Normal traffic: {len(df[df['label'] == 0])}")
print(f"   Attack traffic: {len(df[df['label'] == 1])}")
print(f"   Attack percentage: {100 * df['label'].mean():.1f}%")

print(f"\nüõ°Ô∏è Firewall Actions:")
print(df["action"].value_counts().to_string())

print(f"\nüåê Zone Distribution:")
print(f"   Source zones: {df['src_zone'].value_counts().to_dict()}")
print(f"   Dest zones: {df['dst_zone'].value_counts().to_dict()}")

print(f"\n‚ö†Ô∏è Threat Categories Detected:")
threat_counts = df[df["threat_category"] != "none"]["threat_category"].value_counts()
for threat, count in threat_counts.items():
    print(f"   {threat}: {count}")

print(f"\nüì± Top Application IDs:")
print(df["app_id"].value_counts().head(8).to_string())

In [None]:
# ============================================================
# LAYER 7 (APPLICATION LAYER) FEATURE GENERATION
# Deep Packet Inspection (DPI) style metadata
# ============================================================

def calculate_domain_entropy(length):
    """Simulate domain name entropy (higher = more random/DGA-like)"""
    # Normal domains: 2.5-3.5, DGA domains: 3.8-4.5
    return np.random.uniform(2.5, 3.5) if length < 20 else np.random.uniform(3.5, 4.5)

# --- HTTP Metadata (for web traffic) ---
http_mask = df["dst_port"].isin([80, 443, 8080, 8443])

# User-Agent scores (0=missing, 1=suspicious, 2=normal browser, 3=known good)
df["http_ua_score"] = 0
df.loc[http_mask & (df["label"] == 0), "http_ua_score"] = np.random.choice([2, 3], http_mask.sum() - (http_mask & (df["label"] == 1)).sum(), p=[0.7, 0.3])
df.loc[http_mask & (df["label"] == 1), "http_ua_score"] = np.random.choice([0, 1, 2], (http_mask & (df["label"] == 1)).sum(), p=[0.4, 0.4, 0.2])

# HTTP methods (encoded: GET=1, POST=2, PUT=3, DELETE=4, OPTIONS=5, unusual=6)
df["http_method"] = 0
df.loc[http_mask & (df["label"] == 0), "http_method"] = np.random.choice([1, 2], (http_mask & (df["label"] == 0)).sum(), p=[0.7, 0.3])
df.loc[http_mask & (df["label"] == 1), "http_method"] = np.random.choice([1, 2, 6], (http_mask & (df["label"] == 1)).sum(), p=[0.3, 0.4, 0.3])

# Response code category (2xx=2, 3xx=3, 4xx=4, 5xx=5)
df["http_resp_code"] = 0
df.loc[http_mask & (df["label"] == 0), "http_resp_code"] = np.random.choice([2, 3, 4], (http_mask & (df["label"] == 0)).sum(), p=[0.85, 0.1, 0.05])
df.loc[http_mask & (df["label"] == 1), "http_resp_code"] = np.random.choice([2, 4, 5], (http_mask & (df["label"] == 1)).sum(), p=[0.5, 0.3, 0.2])

# Content-Type risk (0=none, 1=safe, 2=risky like exe/zip)
df["http_content_risk"] = 0
df.loc[http_mask & (df["label"] == 0), "http_content_risk"] = np.random.choice([1, 2], (http_mask & (df["label"] == 0)).sum(), p=[0.95, 0.05])
df.loc[http_mask & (df["label"] == 1), "http_content_risk"] = np.random.choice([1, 2], (http_mask & (df["label"] == 1)).sum(), p=[0.4, 0.6])

# --- DNS Metadata (for DNS traffic) ---
dns_mask = df["dst_port"] == 53

# DNS query type (1=A, 2=AAAA, 3=MX, 4=TXT, 5=CNAME, 6=NS)
df["dns_query_type"] = 0
df.loc[dns_mask & (df["label"] == 0), "dns_query_type"] = np.random.choice([1, 2, 3], (dns_mask & (df["label"] == 0)).sum(), p=[0.7, 0.2, 0.1])
df.loc[dns_mask & (df["label"] == 1), "dns_query_type"] = np.random.choice([1, 4, 5], (dns_mask & (df["label"] == 1)).sum(), p=[0.3, 0.5, 0.2])  # TXT for tunneling

# Domain entropy (DGA detection) - higher = more random
df["dns_domain_entropy"] = np.random.uniform(2.5, 3.5, len(df))  # Normal baseline
df.loc[dns_mask & (df["label"] == 1), "dns_domain_entropy"] = np.random.uniform(3.8, 4.5, (dns_mask & (df["label"] == 1)).sum())  # DGA-like

# Domain label count (subdomain depth)
df["dns_label_count"] = np.random.choice([2, 3, 4], len(df), p=[0.5, 0.35, 0.15])  # Normal: 2-4 labels
df.loc[dns_mask & (df["label"] == 1), "dns_label_count"] = np.random.choice([4, 5, 6, 7], (dns_mask & (df["label"] == 1)).sum(), p=[0.2, 0.3, 0.3, 0.2])  # Deep subdomains

# DNS response size (larger = suspicious for data exfil)
df["dns_resp_size"] = np.random.normal(150, 40, len(df)).clip(50, 400)
df.loc[dns_mask & (df["label"] == 1), "dns_resp_size"] = np.random.uniform(400, 800, (dns_mask & (df["label"] == 1)).sum())

# --- TLS Metadata (for HTTPS/encrypted traffic) ---
tls_mask = df["dst_port"].isin([443, 8443, 993, 995, 587])

# JA3 fingerprint risk score (0=unknown, 1=known malware, 2=suspicious, 3=normal)
df["tls_ja3_risk"] = 0
df.loc[tls_mask & (df["label"] == 0), "tls_ja3_risk"] = np.random.choice([0, 3], (tls_mask & (df["label"] == 0)).sum(), p=[0.1, 0.9])
df.loc[tls_mask & (df["label"] == 1), "tls_ja3_risk"] = np.random.choice([0, 1, 2], (tls_mask & (df["label"] == 1)).sum(), p=[0.3, 0.4, 0.3])

# Certificate validity (0=invalid/expired, 1=self-signed, 2=valid CA)
df["tls_cert_valid"] = 0
df.loc[tls_mask & (df["label"] == 0), "tls_cert_valid"] = np.random.choice([1, 2], (tls_mask & (df["label"] == 0)).sum(), p=[0.05, 0.95])
df.loc[tls_mask & (df["label"] == 1), "tls_cert_valid"] = np.random.choice([0, 1, 2], (tls_mask & (df["label"] == 1)).sum(), p=[0.3, 0.4, 0.3])

# Certificate age in days (negative = expired)
df["tls_cert_age"] = np.random.uniform(30, 365, len(df))  # Normal: 1 month to 1 year
df.loc[tls_mask & (df["label"] == 1), "tls_cert_age"] = np.random.choice(
    [np.random.uniform(-30, 0, 1)[0], np.random.uniform(1, 10, 1)[0], np.random.uniform(30, 100, 1)[0]],
    (tls_mask & (df["label"] == 1)).sum()
)  # Expired or very new

# SNI mismatch (domain doesn't match cert)
df["tls_sni_mismatch"] = 0
df.loc[tls_mask & (df["label"] == 1), "tls_sni_mismatch"] = np.random.choice([0, 1], (tls_mask & (df["label"] == 1)).sum(), p=[0.6, 0.4])

# --- QUIC/HTTP3 Metadata (UDP 443) ---
quic_mask = (df["dst_port"] == 443) & (df["protocol"] == "UDP")

# QUIC version (0=unknown, 1=draft, 2=v1, 3=v2)
df["quic_version"] = 0
df.loc[quic_mask & (df["label"] == 0), "quic_version"] = np.random.choice([2, 3], (quic_mask & (df["label"] == 0)).sum(), p=[0.6, 0.4])
df.loc[quic_mask & (df["label"] == 1), "quic_version"] = np.random.choice([0, 1, 2], (quic_mask & (df["label"] == 1)).sum(), p=[0.4, 0.3, 0.3])

# QUIC 0-RTT usage (resumption, can be abused)
df["quic_0rtt"] = 0
df.loc[quic_mask & (df["label"] == 0), "quic_0rtt"] = np.random.choice([0, 1], (quic_mask & (df["label"] == 0)).sum(), p=[0.7, 0.3])
df.loc[quic_mask & (df["label"] == 1), "quic_0rtt"] = np.random.choice([0, 1], (quic_mask & (df["label"] == 1)).sum(), p=[0.3, 0.7])  # Attackers prefer 0-RTT

# QUIC connection migration count (IP/port changes during session)
df["quic_migration"] = 0
df.loc[quic_mask & (df["label"] == 0), "quic_migration"] = np.random.choice([0, 1], (quic_mask & (df["label"] == 0)).sum(), p=[0.9, 0.1])
df.loc[quic_mask & (df["label"] == 1), "quic_migration"] = np.random.choice([0, 1, 2, 3], (quic_mask & (df["label"] == 1)).sum(), p=[0.3, 0.3, 0.2, 0.2])

# --- gRPC Metadata (port 50051 or HTTP/2 on 443/8080) ---
grpc_mask = (df["app_id"] == "grpc") | (df["dst_port"] == 50051)

# gRPC method type (0=unknown, 1=unary, 2=server-stream, 3=client-stream, 4=bidirectional)
df["grpc_method_type"] = 0
df.loc[grpc_mask & (df["label"] == 0), "grpc_method_type"] = np.random.choice([1, 2], (grpc_mask & (df["label"] == 0)).sum(), p=[0.8, 0.2])
df.loc[grpc_mask & (df["label"] == 1), "grpc_method_type"] = np.random.choice([1, 3, 4], (grpc_mask & (df["label"] == 1)).sum(), p=[0.3, 0.3, 0.4])

# gRPC status code (0=OK, 1-16=errors, higher=suspicious)
df["grpc_status"] = 0
df.loc[grpc_mask & (df["label"] == 0), "grpc_status"] = np.random.choice([0, 1, 2], (grpc_mask & (df["label"] == 0)).sum(), p=[0.9, 0.05, 0.05])
df.loc[grpc_mask & (df["label"] == 1), "grpc_status"] = np.random.choice([0, 3, 7, 13], (grpc_mask & (df["label"] == 1)).sum(), p=[0.4, 0.2, 0.2, 0.2])

# gRPC message size anomaly (unusually large)
df["grpc_msg_size_anomaly"] = 0
df.loc[grpc_mask & (df["label"] == 1), "grpc_msg_size_anomaly"] = np.random.choice([0, 1], (grpc_mask & (df["label"] == 1)).sum(), p=[0.4, 0.6])

# --- WebSocket Metadata ---
ws_mask = df["app_id"] == "websocket"

# WebSocket frame frequency (frames/second, high = suspicious heartbeat/beacon)
df["ws_frame_freq"] = np.random.uniform(0.1, 2, len(df))  # Normal baseline
df.loc[ws_mask & (df["label"] == 0), "ws_frame_freq"] = np.random.uniform(0.5, 5, (ws_mask & (df["label"] == 0)).sum())
df.loc[ws_mask & (df["label"] == 1), "ws_frame_freq"] = np.random.uniform(0.01, 0.5, (ws_mask & (df["label"] == 1)).sum())  # Regular beacons

# WebSocket message entropy (random = encrypted/binary data)
df["ws_msg_entropy"] = np.random.uniform(3, 5, len(df))  # Normal baseline
df.loc[ws_mask & (df["label"] == 0), "ws_msg_entropy"] = np.random.uniform(3.5, 5.5, (ws_mask & (df["label"] == 0)).sum())
df.loc[ws_mask & (df["label"] == 1), "ws_msg_entropy"] = np.random.uniform(6, 7.5, (ws_mask & (df["label"] == 1)).sum())  # High entropy = encrypted C2

# WebSocket binary vs text ratio (binary more suspicious)
df["ws_binary_ratio"] = 0
df.loc[ws_mask & (df["label"] == 0), "ws_binary_ratio"] = np.random.uniform(0, 0.3, (ws_mask & (df["label"] == 0)).sum())
df.loc[ws_mask & (df["label"] == 1), "ws_binary_ratio"] = np.random.uniform(0.5, 1.0, (ws_mask & (df["label"] == 1)).sum())

# --- RMM (Remote Management) Metadata ---
rmm_apps = ["anydesk", "teamviewer", "connectwise", "splashtop", "screenconnect", "atera"]
rmm_mask = df["app_id"].isin(rmm_apps)

# RMM session duration risk (very long sessions are suspicious)
df["rmm_duration_risk"] = 0
normal_rmm = rmm_mask & (df["label"] == 0)
attack_rmm = rmm_mask & (df["label"] == 1)
df.loc[normal_rmm, "rmm_duration_risk"] = np.random.choice([0, 1], normal_rmm.sum(), p=[0.8, 0.2])
df.loc[attack_rmm, "rmm_duration_risk"] = np.random.choice([1, 2, 3], attack_rmm.sum(), p=[0.2, 0.4, 0.4])

# RMM source zone anomaly (external/guest using RMM = suspicious)
df["rmm_zone_anomaly"] = 0
df.loc[rmm_mask & (df["src_zone"].isin(["untrust", "guest"])), "rmm_zone_anomaly"] = 1
df.loc[rmm_mask & (df["src_zone"] == "untrust") & (df["dst_zone"] == "trust"), "rmm_zone_anomaly"] = 2

# RMM multi-target (connecting to multiple internal hosts)
df["rmm_multi_target"] = 0
df.loc[rmm_mask & (df["dst_ip_count"] > 1), "rmm_multi_target"] = 1
df.loc[rmm_mask & (df["dst_ip_count"] > 3), "rmm_multi_target"] = 2

# --- Application Layer Summary ---
print("üîç Layer 7 (Application Layer) Features Generated:")
print(f"\\n   HTTP Features (port 80/443/8080/8443):")
print(f"      ‚Ä¢ User-Agent score (0-3)")
print(f"      ‚Ä¢ HTTP method encoded")
print(f"      ‚Ä¢ Response code category")
print(f"      ‚Ä¢ Content-Type risk level")
print(f"\\n   DNS Features (port 53):")
print(f"      ‚Ä¢ Query type (A/AAAA/MX/TXT/CNAME)")
print(f"      ‚Ä¢ Domain entropy (DGA detection)")
print(f"      ‚Ä¢ Subdomain depth")
print(f"      ‚Ä¢ Response size")
print(f"\\n   TLS Features (port 443/8443):")
print(f"      ‚Ä¢ JA3 fingerprint risk")
print(f"      ‚Ä¢ Certificate validity")
print(f"      ‚Ä¢ Certificate age")
print(f"      ‚Ä¢ SNI mismatch flag")
print(f"\\n   QUIC/HTTP3 Features (UDP 443):")
print(f"      ‚Ä¢ QUIC version")
print(f"      ‚Ä¢ 0-RTT resumption")
print(f"      ‚Ä¢ Connection migration count")
print(f"\\n   gRPC Features (port 50051/HTTP2):")
print(f"      ‚Ä¢ Method type (unary/stream)")
print(f"      ‚Ä¢ Status code")
print(f"      ‚Ä¢ Message size anomaly")
print(f"\\n   WebSocket Features:")
print(f"      ‚Ä¢ Frame frequency")
print(f"      ‚Ä¢ Message entropy")
print(f"      ‚Ä¢ Binary vs text ratio")
print(f"\\n   RMM Features (AnyDesk, TeamViewer, etc.):")
print(f"      ‚Ä¢ Session duration risk")
print(f"      ‚Ä¢ Source zone anomaly")
print(f"      ‚Ä¢ Multi-target indicator")

print(f"\\nüìä Layer 7 Feature Statistics:")
print(f"   HTTP traffic with L7 data: {http_mask.sum()} flows")
print(f"   DNS traffic with L7 data: {dns_mask.sum()} flows")
print(f"   TLS traffic with L7 data: {tls_mask.sum()} flows")
print(f"   QUIC traffic with L7 data: {quic_mask.sum()} flows")
print(f"   gRPC traffic with L7 data: {grpc_mask.sum()} flows")
print(f"   WebSocket traffic with L7 data: {ws_mask.sum()} flows")
print(f"   RMM traffic with L7 data: {rmm_mask.sum()} flows")

## 2. Feature Engineering (L3-L7)

In [None]:
# Engineer comprehensive network features
df["duration"] = df["duration"].clip(lower=0.001)

# ============================================================
# Traffic Volume Features
# ============================================================
df["total_bytes"] = df["bytes_sent"] + df["bytes_recv"]
df["total_packets"] = df["packets_sent"] + df["packets_recv"]

# ============================================================
# Rate Features (key for detecting high-volume attacks)
# ============================================================
df["bytes_per_second"] = df["total_bytes"] / df["duration"]
df["packets_per_second"] = df["total_packets"] / df["duration"]
df["bytes_per_packet"] = df["total_bytes"] / (df["total_packets"] + 1)

# ============================================================
# Ratio Features (asymmetric traffic is suspicious)
# ============================================================
df["bytes_ratio"] = df["bytes_sent"] / (df["total_bytes"] + 1)  # >0.5 = more sent than recv
df["packets_ratio"] = df["packets_sent"] / (df["total_packets"] + 1)
df["send_recv_ratio"] = (df["bytes_sent"] + 1) / (df["bytes_recv"] + 1)

# ============================================================
# Port Features
# ============================================================
WELL_KNOWN_PORTS = [80, 443, 22, 25, 53, 143, 993, 995, 587, 3306, 5432]
SUSPICIOUS_PORTS = [4444, 8888, 31337, 6667, 1337, 3333, 14444, 8333, 4443]

df["is_well_known_port"] = df["dst_port"].isin(WELL_KNOWN_PORTS).astype(int)
df["is_suspicious_port"] = df["dst_port"].isin(SUSPICIOUS_PORTS).astype(int)
df["is_high_port"] = (df["dst_port"] > 1024).astype(int)

# ============================================================
# Connection Pattern Features (fan-out detection)
# ============================================================
df["is_multi_dest"] = (df["dst_ip_count"] > 1).astype(int)  # Many destinations = scan
df["is_multi_src"] = (df["src_ip_count"] > 1).astype(int)  # Many sources = DDoS

# ============================================================
# Protocol Features
# ============================================================
df["is_tcp"] = (df["protocol"] == "TCP").astype(int)
df["is_udp"] = (df["protocol"] == "UDP").astype(int)

# ============================================================
# FIREWALL-SPECIFIC Features
# ============================================================
# Action encoding (0=allow, 1=alert, 2=deny, 3=drop)
action_map = {"allow": 0, "alert": 1, "deny": 2, "drop": 3}
df["action_code"] = df["action"].map(action_map)
df["is_blocked"] = df["action"].isin(["deny", "drop"]).astype(int)
df["is_alert"] = (df["action"] == "alert").astype(int)

# Zone encoding
zone_map = {"trust": 0, "dmz": 1, "guest": 2, "untrust": 3}
df["src_zone_code"] = df["src_zone"].map(zone_map)
df["dst_zone_code"] = df["dst_zone"].map(zone_map)
df["zone_transition"] = df["src_zone_code"] * 4 + df["dst_zone_code"]  # Unique zone pair
df["is_external_inbound"] = ((df["src_zone"] == "untrust") & (df["dst_zone"] != "untrust")).astype(int)
df["is_internal_lateral"] = ((df["src_zone"] == "trust") & (df["dst_zone"] == "trust")).astype(int)

# Threat category encoding
df["has_threat"] = (df["threat_category"] != "none").astype(int)

# Rule ID features
df["is_implicit_deny"] = (df["rule_id"] == 999).astype(int)

# ============================================================
# Log-transformed features (handle extreme values)
# ============================================================
df["log_bytes"] = np.log1p(df["total_bytes"])
df["log_packets"] = np.log1p(df["total_packets"])
df["log_duration"] = np.log1p(df["duration"])
df["log_bps"] = np.log1p(df["bytes_per_second"])
df["log_pps"] = np.log1p(df["packets_per_second"])

print("üìä Engineered Features Summary:")
print(f"   L3/L4 Network features: 18")
print(f"   L7 Application features: 24 (HTTP:4, DNS:4, TLS:4, QUIC:3, gRPC:3, WS:3, RMM:3)")
print(f"   Firewall features: 8")
print(f"   Total features: 50")

print("\nüìà Key L3/L4 Feature Statistics:")
key_features = ["bytes_per_second", "packets_per_second", "bytes_ratio", "duration"]
for feat in key_features:
    print(f"   {feat}: Normal={df[df['label']==0][feat].mean():.2f}, Attack={df[df['label']==1][feat].mean():.2f}")

print("\nüîç Key L7 Feature Statistics:")
l7_features = ["dns_domain_entropy", "http_ua_score", "tls_ja3_risk"]
for feat in l7_features:
    print(f"   {feat}: Normal={df[df['label']==0][feat].mean():.2f}, Attack={df[df['label']==1][feat].mean():.2f}")

print("\nüî• Firewall Feature Statistics:")
print(f"   Blocked sessions: {df['is_blocked'].sum()} ({100*df['is_blocked'].mean():.1f}%)")
print(f"   Alert sessions: {df['is_alert'].sum()} ({100*df['is_alert'].mean():.1f}%)")
print(f"   External inbound: {df['is_external_inbound'].sum()}")
print(f"   Internal lateral: {df['is_internal_lateral'].sum()}")
print(f"   Implicit deny hits: {df['is_implicit_deny'].sum()}")
print(f"   Threat detections: {df['has_threat'].sum()}")

In [None]:
# Interactive feature distribution comparison with Plotly
features_to_plot = ["bytes_per_second", "packets_per_second", "bytes_ratio", "duration"]

fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=[f"{f} Distribution" for f in features_to_plot]
)

positions = [(1, 1), (1, 2), (2, 1), (2, 2)]

for feature, (row, col) in zip(features_to_plot, positions):
    # Normal traffic
    fig.add_trace(
        go.Histogram(
            x=df[df["label"] == 0][feature],
            name="Normal",
            opacity=0.6,
            marker_color="#2ecc71",
            legendgroup="normal",
            showlegend=(row == 1 and col == 1),
            histnorm="probability density",
        ),
        row=row, col=col
    )
    # Anomaly traffic
    fig.add_trace(
        go.Histogram(
            x=df[df["label"] == 1][feature],
            name="Anomaly",
            opacity=0.6,
            marker_color="#e74c3c",
            legendgroup="anomaly",
            showlegend=(row == 1 and col == 1),
            histnorm="probability density",
        ),
        row=row, col=col
    )

fig.update_layout(
    height=700,
    width=1000,
    template=PLOTLY_TEMPLATE,
    title_text="Feature Distributions: Normal vs Anomaly Traffic",
    barmode="overlay",
    legend=dict(orientation="h", yanchor="bottom", y=-0.12),
)
fig.show()

## 3. Prepare Features for Anomaly Detection

In [None]:
# Select comprehensive L3-L7 + Firewall features for anomaly detection
feature_cols = [
    # === L3/L4 FEATURES (Network/Transport Layer) ===
    # Rate features (most discriminative)
    "log_bps",
    "log_pps",
    "bytes_per_packet",
    # Ratio features (detect asymmetric traffic)
    "bytes_ratio",
    "packets_ratio",
    "send_recv_ratio",
    # Duration (detect long-running or burst attacks)
    "log_duration",
    # Volume features
    "log_bytes",
    "log_packets",
    # Port-based features
    "is_well_known_port",
    "is_suspicious_port",
    "is_high_port",
    # Connection pattern features
    "is_multi_dest",
    "is_multi_src",
    # Protocol features
    "is_tcp",
    "is_udp",

    # === L7 FEATURES (Application Layer - NGFW/DPI) ===
    # HTTP inspection
    "http_ua_score",
    "http_method",
    "http_resp_code",
    "http_content_risk",
    # DNS inspection (DGA/tunneling detection)
    "dns_query_type",
    "dns_domain_entropy",
    "dns_label_count",
    "dns_resp_size",
    # TLS inspection
    "tls_ja3_risk",
    "tls_cert_valid",
    "tls_cert_age",
    "tls_sni_mismatch",
    # QUIC/HTTP3 inspection
    "quic_version",
    "quic_0rtt",
    "quic_migration",
    # gRPC inspection
    "grpc_method_type",
    "grpc_status",
    "grpc_msg_size_anomaly",
    # WebSocket inspection
    "ws_frame_freq",
    "ws_msg_entropy",
    "ws_binary_ratio",
    # RMM inspection (AnyDesk, TeamViewer, etc.)
    "rmm_duration_risk",
    "rmm_zone_anomaly",
    "rmm_multi_target",

    # === FIREWALL-SPECIFIC FEATURES ===
    "action_code",
    "is_blocked",
    "is_alert",
    "zone_transition",
    "is_external_inbound",
    "is_internal_lateral",
    "has_threat",
    "is_implicit_deny",
]

X = df[feature_cols].values
y = df["label"].values
attack_types = df["attack_type"].values

# Use RobustScaler for outlier-robust scaling
scaler = RobustScaler()
X_scaled = scaler.fit_transform(X)

print(f"üìä Feature Matrix:")
print(f"   Shape: {X_scaled.shape}")
print(f"   Total Features: {len(feature_cols)}")

l34_features = [f for f in feature_cols if not f.startswith(("http_", "dns_", "tls_", "quic_", "grpc_", "ws_", "rmm_", "action", "is_blocked", "is_alert", "zone", "is_external", "is_internal", "has_threat", "is_implicit"))]
l7_features = [f for f in feature_cols if f.startswith(("http_", "dns_", "tls_", "quic_", "grpc_", "ws_", "rmm_"))]
fw_features = [f for f in feature_cols if f.startswith(("action", "is_blocked", "is_alert", "zone", "is_external", "is_internal", "has_threat", "is_implicit"))]

print(f"\nüìã L3/L4 Network Features ({len(l34_features)}):")
for col in l34_features:
    print(f"   ‚Ä¢ {col}")

print(f"\nüîç L7 Application Features ({len(l7_features)}):")
for col in l7_features:
    proto = col.split("_")[0].upper()
    if proto == "WS":
        proto = "WEBSOCKET"
    elif proto == "RMM":
        proto = "RMM"
    print(f"   ‚Ä¢ [{proto}] {col}")

print(f"\nüî• Firewall Features ({len(fw_features)}):")
for col in fw_features:
    print(f"   ‚Ä¢ {col}")

## 4. Isolation Forest

In [None]:
# Train Isolation Forest
iso_forest = IsolationForest(
    n_estimators=100, contamination=0.1, random_state=42  # Expected proportion of anomalies
)

# Predict: -1 for anomaly, 1 for normal
iso_pred = iso_forest.fit_predict(X_scaled)

# Convert to binary (1 for anomaly, 0 for normal)
iso_pred_binary = (iso_pred == -1).astype(int)

print("Isolation Forest Results:")
print(f"Predicted anomalies: {iso_pred_binary.sum()}")
print(f"Actual anomalies: {y.sum()}")

In [None]:
# Evaluate Isolation Forest
precision = precision_score(y, iso_pred_binary)
recall = recall_score(y, iso_pred_binary)
f1 = f1_score(y, iso_pred_binary)

print("Isolation Forest Metrics:")
print(f"Precision: {precision:.3f}")
print(f"Recall: {recall:.3f}")
print(f"F1 Score: {f1:.3f}")

# Interactive confusion matrix with Plotly
cm = confusion_matrix(y, iso_pred_binary)
labels = ["Normal", "Anomaly"]

# Create text annotations with count and percentage
cm_text = [[f"{cm[i][j]}<br>({cm[i][j]/cm.sum()*100:.1f}%)" for j in range(2)] for i in range(2)]

fig = go.Figure(data=go.Heatmap(
    z=cm,
    x=labels,
    y=labels,
    text=cm_text,
    texttemplate="%{text}",
    colorscale="Blues",
    showscale=True,
    hovertemplate="Actual: %{y}<br>Predicted: %{x}<br>Count: %{z}<extra></extra>",
))

fig.update_layout(
    title="Isolation Forest Confusion Matrix",
    xaxis_title="Predicted",
    yaxis_title="Actual",
    template=PLOTLY_TEMPLATE,
    width=500,
    height=450,
)
fig.show()

## 5. One-Class SVM

In [None]:
# Train One-Class SVM (on normal data only for proper one-class learning)
# In practice, you'd train only on normal traffic
ocsvm = OneClassSVM(kernel="rbf", gamma="scale", nu=0.1)  # Upper bound on fraction of outliers

ocsvm_pred = ocsvm.fit_predict(X_scaled)
ocsvm_pred_binary = (ocsvm_pred == -1).astype(int)

print("One-Class SVM Results:")
print(f"Predicted anomalies: {ocsvm_pred_binary.sum()}")

## 6. Local Outlier Factor

In [None]:
# Train Local Outlier Factor
lof = LocalOutlierFactor(n_neighbors=20, contamination=0.1)

lof_pred = lof.fit_predict(X_scaled)
lof_pred_binary = (lof_pred == -1).astype(int)

print("LOF Results:")
print(f"Predicted anomalies: {lof_pred_binary.sum()}")

## 7. Compare All Models

In [None]:
# Compare all models
models = {
    "Isolation Forest": iso_pred_binary,
    "One-Class SVM": ocsvm_pred_binary,
    "Local Outlier Factor": lof_pred_binary,
}

results = []
for name, pred in models.items():
    results.append(
        {
            "Model": name,
            "Precision": precision_score(y, pred),
            "Recall": recall_score(y, pred),
            "F1": f1_score(y, pred),
        }
    )

results_df = pd.DataFrame(results)
print("Model Comparison:")
print(results_df.to_string(index=False))

# Interactive model comparison with Plotly
fig = go.Figure()

colors = {"Precision": "#3498db", "Recall": "#2ecc71", "F1": "#e74c3c"}

for metric in ["Precision", "Recall", "F1"]:
    fig.add_trace(go.Bar(
        name=metric,
        x=results_df["Model"],
        y=results_df[metric],
        marker_color=colors[metric],
        hovertemplate=f"<b>%{{x}}</b><br>{metric}: %{{y:.3f}}<extra></extra>",
    ))

fig.update_layout(
    title="Anomaly Detection Model Comparison",
    xaxis_title="Model",
    yaxis_title="Score",
    barmode="group",
    template=PLOTLY_TEMPLATE,
    height=450,
    width=800,
    legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="center", x=0.5),
)
fig.show()

In [None]:
# Attack Progression Timeline - visualize anomalies over simulated time
# Create synthetic timeline from flow data
df['flow_id'] = range(len(df))
df['simulated_time'] = pd.date_range(start='2024-01-15 00:00', periods=len(df), freq='1min')

# Get anomaly predictions from best model (Isolation Forest)
df['anomaly_score'] = -iso_forest.score_samples(X_test_scaled)
df['is_anomaly'] = df['iso_pred'] == 1

fig = make_subplots(
    rows=3, cols=1,
    subplot_titles=[
        'Network Traffic Volume Over Time',
        'Anomaly Score Timeline (Higher = More Suspicious)',
        'Attack Detection Events'
    ],
    vertical_spacing=0.1,
    row_heights=[0.3, 0.4, 0.3],
)

# Panel 1: Traffic volume (bytes)
fig.add_trace(
    go.Scatter(
        x=df['simulated_time'],
        y=df['bytes'],
        mode='lines',
        name='Bytes',
        line=dict(color='#3498db', width=1),
        opacity=0.7,
    ),
    row=1, col=1
)

# Panel 2: Anomaly scores with threshold
fig.add_trace(
    go.Scatter(
        x=df['simulated_time'],
        y=df['anomaly_score'],
        mode='lines',
        name='Anomaly Score',
        line=dict(color='#9b59b6', width=1),
    ),
    row=2, col=1
)

# Add threshold line
threshold = df[df['is_anomaly']]['anomaly_score'].min() if df['is_anomaly'].any() else df['anomaly_score'].quantile(0.95)
fig.add_hline(y=threshold, line_dash='dash', line_color='red',
              annotation_text='Detection Threshold', row=2, col=1)

# Panel 3: Detected attacks as events
anomalies = df[df['is_anomaly']]
if len(anomalies) > 0:
    # Color by attack type if available
    if 'attack_type' in anomalies.columns:
        for attack_type in anomalies['attack_type'].unique():
            attack_data = anomalies[anomalies['attack_type'] == attack_type]
            fig.add_trace(
                go.Scatter(
                    x=attack_data['simulated_time'],
                    y=[attack_type] * len(attack_data),
                    mode='markers',
                    name=attack_type,
                    marker=dict(size=10, symbol='x'),
                ),
                row=3, col=1
            )
    else:
        fig.add_trace(
            go.Scatter(
                x=anomalies['simulated_time'],
                y=anomalies['anomaly_score'],
                mode='markers',
                name='Detected Anomalies',
                marker=dict(color='#e74c3c', size=8, symbol='x'),
            ),
            row=3, col=1
        )

fig.update_layout(
    title='‚è±Ô∏è Attack Progression Timeline',
    template=PLOTLY_TEMPLATE,
    height=700,
    showlegend=True,
    legend=dict(orientation='h', yanchor='bottom', y=1.02),
)

fig.update_xaxes(title_text='Time', row=3, col=1)
fig.update_yaxes(title_text='Bytes', row=1, col=1)
fig.update_yaxes(title_text='Anomaly Score', row=2, col=1)
fig.update_yaxes(title_text='Attack Type', row=3, col=1)

fig.show()

# Attack timeline summary
if len(anomalies) > 0:
    print('üö® Attack Timeline Summary:')
    print(f'   First detection: {anomalies["simulated_time"].min()}')
    print(f'   Last detection: {anomalies["simulated_time"].max()}')
    print(f'   Total anomalies: {len(anomalies)}')
    if 'attack_type' in anomalies.columns:
        print('\n   Attacks by type:')
        for atype, count in anomalies['attack_type'].value_counts().items():
            print(f'      {atype}: {count}')


## 8. Anomaly Score Analysis

In [None]:
# Get anomaly scores from Isolation Forest
anomaly_scores = -iso_forest.score_samples(X_scaled)
df["anomaly_score"] = anomaly_scores

# Interactive anomaly score distribution with Plotly
threshold_90 = np.percentile(anomaly_scores, 90)

fig = go.Figure()

# Normal traffic histogram
fig.add_trace(go.Histogram(
    x=df[df["label"] == 0]["anomaly_score"],
    name="Normal",
    opacity=0.6,
    marker_color="#2ecc71",
    histnorm="probability density",
    hovertemplate="Score: %{x:.3f}<br>Density: %{y:.4f}<extra>Normal</extra>",
))

# Anomaly traffic histogram
fig.add_trace(go.Histogram(
    x=df[df["label"] == 1]["anomaly_score"],
    name="Anomaly",
    opacity=0.6,
    marker_color="#e74c3c",
    histnorm="probability density",
    hovertemplate="Score: %{x:.3f}<br>Density: %{y:.4f}<extra>Anomaly</extra>",
))

# 90th percentile threshold line
fig.add_vline(
    x=threshold_90,
    line_dash="dash",
    line_color="black",
    annotation_text=f"90th percentile ({threshold_90:.3f})",
    annotation_position="top right",
)

fig.update_layout(
    title="Anomaly Score Distribution (Isolation Forest)",
    xaxis_title="Anomaly Score",
    yaxis_title="Density",
    barmode="overlay",
    template=PLOTLY_TEMPLATE,
    height=450,
    width=900,
    legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="center", x=0.5),
)
fig.show()

In [None]:
# Show top anomalies with attack type classification
print("üö® Top 15 Most Anomalous Flows:")
print("=" * 80)

top_anomalies = df.nlargest(15, "anomaly_score")[
    [
        "attack_type",
        "dst_port",
        "bytes_sent",
        "bytes_recv",
        "packets_sent",
        "duration",
        "anomaly_score",
        "label",
    ]
]
print(top_anomalies.to_string())

# Detection by attack type
print("\n\nüìä Detection Performance by Attack Type:")
print("=" * 80)

df["predicted"] = iso_pred_binary

for attack_type in df["attack_type"].unique():
    subset = df[df["attack_type"] == attack_type]
    if attack_type == "normal":
        # For normal, we want low false positive rate
        fp = subset["predicted"].sum()
        fp_rate = 100 * fp / len(subset)
        print(f"   {attack_type:18s}: {len(subset):4d} flows, False Positive Rate: {fp_rate:.1f}%")
    else:
        # For attacks, we want high detection rate
        detected = subset["predicted"].sum()
        detection_rate = 100 * detected / len(subset)
        print(
            f"   {attack_type:18s}: {len(subset):4d} flows, Detection Rate: {detection_rate:.1f}%"
        )

# Summary of which attacks are hardest to detect
print("\n\nüéØ Attack Detection Summary:")
hardest_to_detect = []
for attack_type in df[df["label"] == 1]["attack_type"].unique():
    subset = df[df["attack_type"] == attack_type]
    detected = subset["predicted"].sum()
    detection_rate = 100 * detected / len(subset)
    hardest_to_detect.append((attack_type, detection_rate, len(subset)))

hardest_to_detect.sort(key=lambda x: x[1])
print("   Attacks ranked by detection difficulty (hardest first):")
for attack, rate, count in hardest_to_detect:
    status = "üî¥" if rate < 50 else "üü°" if rate < 80 else "üü¢"
    print(f"   {status} {attack}: {rate:.1f}% ({count} samples)")

## Summary

In this lab, we built a **Next-Gen Firewall (NGFW) anomaly detection system** using real firewall traffic logs with Layer 7 deep packet inspection for **7 application protocols** including RMM tool abuse detection.

### Firewall Traffic Log Features

Unlike NetFlow, firewall logs provide:
- **Actions**: allow, deny, drop, alert
- **Zones**: trust, untrust, dmz, guest
- **Threat Categories**: scan, brute-force, C2, flood, remote-access-trojan, etc.
- **Application IDs**: ssl, dns, ssh, quic, grpc, websocket, **anydesk, teamviewer**, etc.
- **URL Categories**: business, malware, cryptocurrency, remote-access, etc.

### Dataset Characteristics
- **Normal Traffic**: Web, Email, DNS, SSH, Database, API, QUIC, gRPC, WebSocket, **RMM**
- **Attack Traffic**: 12 attack types including **RMM abuse (T1219)** - used by LAPSUS$, Conti, BlackCat
- **Zone Transitions**: Trust‚ÜíUntrust, Untrust‚ÜíTrust, Internal lateral

### Feature Engineering (50 Features)

| Layer | Features | Protocols |
|-------|----------|-----------|
| **L3/L4** | 18 | Rate, ratio, duration, port, protocol |
| **L7 (DPI)** | 24 | HTTP, DNS, TLS, QUIC, gRPC, WebSocket, **RMM** |
| **Firewall** | 8 | Action, zones, threat, rule hits |

### L7 Protocol Coverage

| Protocol | Normal Use | Attack Abuse | Key Features |
|----------|------------|--------------|--------------|
| **HTTP/1.1** | Web browsing | C2 beacon, exfil | UA score, method, response |
| **DNS** | Name resolution | Tunneling, DGA | Entropy, query type, size |
| **TLS** | Encryption | Suspicious certs | JA3, cert age, SNI |
| **QUIC/HTTP3** | Google, CDNs | Encrypted C2 tunnel | Version, 0-RTT, migration |
| **gRPC** | Microservices | API abuse, injection | Method type, status, size |
| **WebSocket** | Real-time apps | Persistent C2 | Frame freq, entropy, binary |
| **RMM Tools** | IT admin | **Living-off-the-land** | Duration, zone, multi-target |

### RMM Abuse Detection (T1219)

| Tool | Legitimate Use | Attack Indicators |
|------|----------------|-------------------|
| **AnyDesk** | Remote support | External‚ÜíInternal, long duration |
| **TeamViewer** | Remote access | Guest zone, multi-target |
| **ConnectWise** | MSP management | Off-hours, non-IT source |
| **Splashtop** | Screen sharing | High data transfer |
| **ScreenConnect** | Help desk | Untrust zone origination |

### MITRE ATT&CK Coverage
| Attack Type | Technique | Key Indicators |
|-------------|-----------|----------------|
| Port Scan | T1046 | implicit_deny, multi_dest |
| Brute Force | T1110 | threat=brute-force |
| C2 Beacon | T1071 | suspicious UA, low frame_freq |
| Data Exfil | T1048 | high bytes_sent, binary_ratio |
| DNS Tunnel | T1071.004 | high entropy, TXT queries |
| DDoS | T1498 | threat=flood, multi_src |
| Lateral Move | T1021.002 | internal_lateral, SMB |
| Cryptomining | T1496 | threat=cryptocurrency |
| QUIC C2 | T1572 | unknown version, high 0-RTT |
| gRPC Abuse | T1190 | error status, large messages |
| WebSocket C2 | T1071.001 | high entropy, low freq, binary |
| **RMM Abuse** | **T1219** | **zone_anomaly, duration_risk, multi_target** |

### Key Takeaways
1. **RMM tools are the new LOLBins** - legitimate tools abused for persistence/C2
2. **Zone-based detection is critical** - RMM from untrust zone is highly suspicious
3. **Duration matters** - 8-hour RMM session from guest zone = red flag
4. **Multi-target RMM = lateral movement** - legitimate IT usually single-target
5. **50 features across 7 protocols** provide comprehensive L7 visibility

### Next Steps
1. Add **RMM behavioral baseline** per user/department
2. Implement **time-of-day anomaly** detection for RMM usage
3. Add **gRPC reflection** detection for service enumeration
4. Build **RMM process correlation** with endpoint telemetry