🔍 DPI Network Intelligence Platform

AI-Powered Deep Packet Inspection · Real-Time Traffic Analysis · Cybersecurity ML

A production-grade AI/cybersecurity platform that combines a C++17 multi-threaded DPI engine with an ML pipeline for real-time traffic classification, unsupervised anomaly detection, SHAP explainability, and an LLM-powered network analyst — validated on 2.8M real network flows from the CIC-IDS-2017 benchmark.

🏆 98.95% Accuracy	🔬 2.83M Flows	🚨 0.625 ROC-AUC	⚡ 4 Worker Threads	🤖 LLM Analyst
CIC-IDS-2017 RF	Real Benchmark	Isolation Forest	LB → FP Pipeline	Groq + Llama 3.1

📸 Screenshots

📊 Overview _{Total flows, packets, bytes, top apps, protocol split}	🎯 Model Performance _{CIC-IDS-2017 benchmark — 98.95% accuracy, per-class F1}
🚨 Anomaly Detector _{Isolation Forest — 100% Heartbleed, 91.7% Infiltration detection}	🗂️ Traffic Map _{Filterable live flow table with SNI, ports, bytes, duration}
🔴 Anomaly Detection _{Score distribution, anomalies by app, 5% anomaly rate}	🤖 AI Analyst _{Natural language queries via Groq + Llama 3.1 70B}

📋 Table of Contents

What is DPI?
Networking Background
Project Overview
System Architecture
File Structure
The Journey of a Packet — Simple Version
The Journey of a Packet — Multi-threaded Version
Deep Dive: Each Component
How SNI Extraction Works
How Blocking Works
ML Pipeline
Performance & Benchmark Results
Dashboard & Demo
Building and Running
Understanding the Output
Future Improvements

⚠️ Repository Notes

Dataset & Model Files

Large datasets, generated flow files, PCAP captures, and trained ML model artifacts are intentionally excluded from this repository to keep the project lightweight and GitHub-friendly.

Excluded assets include:

CIC-IDS-2017 raw datasets
Generated CSV flow exports
PCAP capture files
Large .pkl / .joblib trained models

This repository focuses on:

Source code
System architecture
ML pipeline implementation
Dashboard components
Documentation and reproducibility

Reproducing Results

To reproduce the benchmark results locally:

# Generate test traffic
python generate_test_pcap.py

# Run DPI engine
./dpi_engine test_dpi.pcap output.pcap

# Train classifier
python ml/train_classifier.py

# Run anomaly detection
python ml/anomaly_detector.py

# Launch dashboard
streamlit run dashboard/app.py

CIC-IDS-2017 Dataset

The project was evaluated using the public CIC-IDS-2017 benchmark dataset:

https://www.unb.ca/cic/datasets/ids-2017.html

After downloading the dataset:

python ml/preprocess_cicids.py --input data/cicids2017/MachineLearningCVE/
python ml/evaluate_cicids.py --data flows_cicids.csv
python ml/evaluate_anomaly_cicids.py --data flows_cicids.csv

Deployment Status

Current repository focus:

High-performance DPI engine
ML traffic analysis
Explainable AI
Dashboard visualization
AI traffic analyst

Planned future deployment support:

Docker containerization
docker-compose orchestration
Kubernetes scaling
Cloud-native deployment
Kafka streaming pipeline

1. What is DPI?

Deep Packet Inspection (DPI) is a technology used to examine the contents of network packets as they pass through a checkpoint. Unlike simple firewalls that only look at packet headers (source/destination IP), DPI looks inside the packet payload.

Real-World Uses

Industry	Use Case
ISPs	Throttle or block certain applications (e.g., BitTorrent)
Enterprises	Block social media on office networks
Parental Controls	Block inappropriate websites
Security	Detect malware or intrusion attempts

What This Platform Does

User Traffic (PCAP) ──► [C++ DPI Engine] ──► Filtered Traffic (PCAP)
                               │
                               ▼
                    ┌─────────────────────┐
                    │  Flow CSV Export    │
                    └──────────┬──────────┘
                               │
              ┌────────────────┼────────────────┐
              ▼                ▼                ▼
    [RF Classifier]   [Isolation Forest]   [SHAP Explainer]
    Traffic ID         Anomaly Detection    Why flagged?
              │                │                │
              └────────────────┼────────────────┘
                               ▼
                    [Streamlit Dashboard]
                    [LLM AI Analyst]

2. Networking Background

The Network Stack

When you visit a website, data travels through multiple layers:

┌─────────────────────────────────────────────────────────┐
│ Layer 7: Application    │ HTTP, TLS, DNS               │
├─────────────────────────────────────────────────────────┤
│ Layer 4: Transport      │ TCP (reliable), UDP (fast)   │
├─────────────────────────────────────────────────────────┤
│ Layer 3: Network        │ IP addresses (routing)       │
├─────────────────────────────────────────────────────────┤
│ Layer 2: Data Link      │ MAC addresses (local network)│
└─────────────────────────────────────────────────────────┘

A Packet's Structure

Every network packet is like a Russian nesting doll — headers wrapped inside headers:

┌──────────────────────────────────────────────────────────────────┐
│ Ethernet Header (14 bytes)                                       │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ IP Header (20 bytes)                                         │ │
│ │ ┌──────────────────────────────────────────────────────────┐ │ │
│ │ │ TCP Header (20 bytes)                                    │ │ │
│ │ │ ┌──────────────────────────────────────────────────────┐ │ │ │
│ │ │ │ Payload (Application Data)                           │ │ │ │
│ │ │ │ e.g., TLS Client Hello with SNI                      │ │ │ │
│ │ │ └──────────────────────────────────────────────────────┘ │ │ │
│ │ └──────────────────────────────────────────────────────────┘ │ │
│ └──────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────┘

The Five-Tuple

A connection (or "flow") is uniquely identified by 5 values:

Field	Example	Purpose
Source IP	192.168.1.100	Who is sending
Destination IP	172.217.14.206	Where it's going
Source Port	54321	Sender's application identifier
Destination Port	443	Service being accessed (443 = HTTPS)
Protocol	TCP (6)	TCP or UDP

All packets with the same 5-tuple belong to the same connection. This is how the engine tracks conversations and applies flow-level blocking.

What is SNI?

Server Name Indication (SNI) is part of the TLS/HTTPS handshake. When you visit https://www.youtube.com:

Your browser sends a "Client Hello" message
This message includes the domain name in plaintext (not encrypted yet!)
The server uses this to know which certificate to send

TLS Client Hello:
├── Version: TLS 1.2
├── Random: [32 bytes]
├── Cipher Suites: [list]
└── Extensions:
    └── SNI Extension:
        └── Server Name: "www.youtube.com"  ← Extracted here!

This is the key to DPI: Even though HTTPS is encrypted, the domain name is visible in the first packet.

3. Project Overview

Two C++ Engine Versions

Version	File	Use Case
Simple (Single-threaded)	`src/main_working.cpp`	Learning, small captures
Multi-threaded (pipeline engine)	`src/main_dpi.cpp`	Main production-style DPI engine
Multi-threaded (self-contained)	`src/dpi_mt.cpp`	Compact standalone variant

Full Platform Stack

Layer	Technology	Purpose
Packet Engine	C++17, pthreads	Parse, classify, block traffic
Flow Export	CSV via `FlowExporter`	Bridge C++ → Python
ML Classifier	Random Forest (sklearn)	13-class traffic identification
Anomaly Detector	Isolation Forest (sklearn)	Unsupervised zero-day detection
Explainability	SHAP	Why was a flow flagged?
Dashboard	Streamlit + Plotly	Live visual analytics
AI Analyst	Groq + Llama 3.1 70B	Natural language traffic queries
Benchmark	CIC-IDS-2017	2.83M real-world flow validation

4. System Architecture

ASCII Architecture Diagram

 ┌─────────────────────────────────────────────────────────────────────┐
 │                    DPI NETWORK INTELLIGENCE PLATFORM                │
 └─────────────────────────────────────────────────────────────────────┘

  [PCAP Input File / Live Interface]
           │
           ▼
  ┌─────────────────┐
  │  PCAP Reader    │  Reads packet bytes + timestamps
  │  Thread         │
  └────────┬────────┘
           │  RawPacket structs
           ▼
  ┌─────────────────────────────────────┐
  │         LOAD BALANCER POOL          │
  │  ┌──────────┐     ┌──────────┐      │
  │  │   LB-0   │     │   LB-1   │      │  Hash(5-tuple) % N
  │  └────┬─────┘     └────┬─────┘      │  ensures same flow →
  └───────┼────────────────┼────────────┘  same FP thread
          │                │
     ┌────┘     ┌──────────┘
     ▼          ▼
  ┌──────────────────────────────────────────────────┐
  │              FAST PATH THREAD POOL               │
  │  ┌────────┐  ┌────────┐  ┌────────┐  ┌────────┐ │
  │  │  FP-0  │  │  FP-1  │  │  FP-2  │  │  FP-3  │ │
  │  │        │  │        │  │        │  │        │ │
  │  │ Parse  │  │ Parse  │  │ Parse  │  │ Parse  │ │
  │  │Classify│  │Classify│  │Classify│  │Classify│ │
  │  │ Block  │  │ Block  │  │ Block  │  │ Block  │ │
  │  └───┬────┘  └───┬────┘  └───┬────┘  └───┬────┘ │
  └──────┼───────────┼───────────┼────────────┼──────┘
         │           │           │            │
         └───────────┴─────┬─────┴────────────┘
                           │
              ┌────────────┴───────────┐
              │                        │
              ▼                        ▼
   ┌─────────────────┐      ┌─────────────────────┐
   │  Output Queue   │      │   Flow CSV Export   │
   │  (filtered      │      │   flows.csv         │
   │   packets)      │      │   (per flow stats)  │
   └────────┬────────┘      └──────────┬──────────┘
            │                          │
            ▼                          ▼
   ┌─────────────────┐      ┌──────────────────────────────────────────┐
   │  output.pcap    │      │           PYTHON ML PIPELINE             │
   └─────────────────┘      │                                          │
                            │  ┌──────────────┐  ┌──────────────────┐ │
                            │  │ Random Forest│  │ Isolation Forest │ │
                            │  │ Classifier   │  │ Anomaly Detector │ │
                            │  │ 98.95% acc   │  │ 0.625 ROC-AUC   │ │
                            │  └──────┬───────┘  └────────┬─────────┘ │
                            │         │                    │           │
                            │         └─────────┬──────────┘           │
                            │                   ▼                      │
                            │         ┌──────────────────┐             │
                            │         │ SHAP Explainer   │             │
                            │         │ (Why flagged?)   │             │
                            │         └──────────────────┘             │
                            └──────────────────┬───────────────────────┘
                                               │
                                               ▼
                            ┌──────────────────────────────────────────┐
                            │         STREAMLIT DASHBOARD              │
                            │  Overview │ Traffic Map │ Anomalies      │
                            │  App Breakdown │ 🤖 AI Analyst           │
                            │  (Groq + Llama 3.1 70B)                 │
                            └──────────────────────────────────────────┘

Mermaid Architecture Diagram

flowchart TD
    A[📁 PCAP File / Live Interface] --> B[PCAP Reader Thread]
    B --> C{Load Balancer Pool\nHash 5-tuple % N}
    C --> D[LB-0]
    C --> E[LB-1]
    D --> F[FP-0]
    D --> G[FP-1]
    E --> H[FP-2]
    E --> I[FP-3]
    F & G & H & I --> J[Output Queue]
    F & G & H & I --> K[Flow CSV Export\nflows.csv]
    J --> L[📄 output.pcap]
    K --> M[🌲 Random Forest\nClassifier\n98.95% acc]
    K --> N[🌳 Isolation Forest\nAnomaly Detector\n0.625 AUC]
    M --> O[SHAP Explainer]
    N --> O
    O --> P[📊 Streamlit Dashboard]
    P --> Q[🤖 AI Analyst\nGroq + Llama 3.1]

    style A fill:#1e3a5f,color:#fff
    style P fill:#1e3a5f,color:#fff
    style Q fill:#4a1a5f,color:#fff
    style M fill:#1a3a1f,color:#fff
    style N fill:#1a3a1f,color:#fff

5. File Structure

Packet_analyzer/
│
├── 📁 include/                     # C++ Header files
│   ├── pcap_reader.h               # PCAP file reading
│   ├── packet_parser.h             # Network protocol parsing
│   ├── sni_extractor.h             # TLS/HTTP deep inspection
│   ├── types.h                     # Core data structures
│   ├── flow_exporter.h             # CSV flow export (Step 2)
│   ├── rule_manager.h              # Blocking rules
│   ├── connection_tracker.h        # Stateful flow tracking
│   ├── load_balancer.h             # LB thread management
│   ├── fast_path.h                 # FP thread processing
│   └── dpi_engine.h                # Main orchestrator
│
├── 📁 src/                         # C++ Implementation
│   ├── pcap_reader.cpp
│   ├── packet_parser.cpp
│   ├── sni_extractor.cpp
│   ├── flow_exporter.cpp           # Thread-safe CSV export
│   ├── types.cpp
│   ├── main_working.cpp            # ★ Simple version
│   ├── main_dpi.cpp                # ★ Main multi-threaded pipeline
│   └── dpi_mt.cpp                  # ★ Self-contained multi-threaded variant
│
├── 📁 ml/                          # Python ML Pipeline
│   ├── train_classifier.py         # Random Forest training
│   ├── predict.py                  # Single-flow prediction
│   ├── anomaly_detector.py         # Isolation Forest
│   ├── explain_anomaly.py          # SHAP explainability
│   ├── preprocess_cicids.py        # CIC-IDS-2017 preprocessor
│   ├── evaluate_cicids.py          # Benchmark evaluation
│   ├── evaluate_anomaly_cicids.py  # Anomaly benchmark
│   └── models/                     # Saved .pkl models
│       ├── traffic_classifier.pkl
│       ├── anomaly_detector.pkl
│       └── confusion_matrix.png
│
├── 📁 dashboard/                   # Streamlit Dashboard
│   ├── app.py                      # 5-page dashboard
│   └── llm_analyst.py              # Groq AI analyst
│
├── 📁 tests/                       # Google Test unit tests
│   ├── test_packet_parser.cpp
│   └── test_sni_extractor.cpp
│
├── 📁 data/                        # Dataset (not tracked)
│   └── cicids2017/                 # CIC-IDS-2017 CSVs
│
├── requirements.txt                # Python dependencies
├── generate_test_pcap.py           # Synthetic PCAP generator
├── flows.csv                       # Generated flow data
├── flows_with_anomalies.csv        # Enriched with anomaly scores
├── test_dpi.pcap                   # Sample capture
├── CMakeLists.txt                  # CMake build config
└── README.md

6. The Journey of a Packet — Simple Version

Tracing a single packet through main_working.cpp:

Step 1: Read PCAP File

PcapReader reader;
reader.open("capture.pcap");

PCAP File Format:

┌────────────────────────────┐
│ Global Header (24 bytes)   │  ← Read once at start
├────────────────────────────┤
│ Packet Header (16 bytes)   │  ← Timestamp, length
│ Packet Data (variable)     │  ← Actual network bytes
├────────────────────────────┤
│ Packet Header (16 bytes)   │
│ Packet Data (variable)     │
└────────────────────────────┘

Step 2: Parse Protocol Headers

PacketParser::parse(raw, parsed);

raw.data bytes:
[0-13]   Ethernet Header  → parsed.src_mac, parsed.dest_mac, ether_type
[14-33]  IPv4 Header      → parsed.src_ip, parsed.dest_ip, parsed.protocol
[34-53]  TCP Header       → parsed.src_port, parsed.dest_port, parsed.tcp_flags
[54+]    Payload          → TLS ClientHello, HTTP headers, etc.

Step 3: Extract SNI

auto sni = SNIExtractor::extract(payload, payload_len);
// Returns: std::optional<std::string>
// Example: "www.youtube.com"

Step 4: Classify Application

AppType app = sniToAppType(sni.value());
// "www.youtube.com" → AppType::YOUTUBE
// "www.facebook.com" → AppType::FACEBOOK

Step 5: Apply Rules + Forward/Drop

if (rules.isBlocked(src_ip, app, sni)) {
    stats.dropped++;
    continue;   // Don't write to output
}
output.write(packet);
stats.forwarded++;

7. The Journey of a Packet — Multi-threaded Version

Thread Architecture

Main Thread
    │
    ├── Spawn: LB Thread 0 ──► [TSQueue] ──► FP Thread 0
    │                     └──► [TSQueue] ──► FP Thread 1
    │
    ├── Spawn: LB Thread 1 ──► [TSQueue] ──► FP Thread 2
    │                     └──► [TSQueue] ──► FP Thread 3
    │
    ├── Spawn: Output Thread (writes filtered packets)
    │
    └── Reader Loop: reads packets → dispatches to LBs

Why Hash-Based Load Balancing?

size_t lb_idx = FiveTupleHash()(pkt.tuple) % num_lbs;
lbs_[lb_idx]->queue().push(pkt);

The key insight: All packets of the same TCP connection have the same 5-tuple. By hashing the 5-tuple, we guarantee that all packets of one connection always go to the same FP thread. This means:

✅ No race conditions on flow state
✅ No mutexes needed per flow
✅ Linear scalability with thread count

Thread-Safe Queue (Lock-Free Design)

Producer (LB Thread)              Consumer (FP Thread)
        │                                 │
        ▼                                 ▼
   mutex.lock()                      mutex.lock()
   queue.push(packet)                packet = queue.front()
   not_empty.notify()                queue.pop()
   mutex.unlock()                    mutex.unlock()
        │                                 │
   not_full.wait()                   not_empty.wait()
   (if queue full)                   (if queue empty)

Fast Path Processing Loop

Each FP thread runs this loop independently:

while (running) {
    pkt = input_queue.pop(timeout=100ms)
    
    if (!pkt) continue   // timeout, check if still running
    
    flow = flows_[pkt.tuple]          // O(1) hash lookup
    
    if (!flow.classified):
        try SNI extraction            // TLS port 443
        try HTTP host extraction      // HTTP port 80
        try DNS detection             // port 53
    
    if (!flow.blocked):
        flow.blocked = rules.check(src_ip, app, sni)
    
    if (flow.blocked):
        stats.dropped++
    else:
        output_queue.push(pkt)
        stats.forwarded++
    
    if (TCP FIN or RST):
        exportFlow(flow)              // write to flows.csv
        flows_.erase(tuple)           // cleanup
}

8. Deep Dive: Each Component

PcapReader

Reads binary PCAP files in two steps:

Global Header (24 bytes): Magic number (validates file), version, timestamp precision, max packet size, link type
Per-Packet: 16-byte header (timestamps + lengths) followed by raw packet data

struct PcapGlobalHeader {
    uint32_t magic_number;    // 0xa1b2c3d4 = valid PCAP
    uint16_t version_major;   // usually 2
    uint16_t version_minor;   // usually 4
    int32_t  thiszone;        // timezone (usually 0)
    uint32_t sigfigs;         // timestamp accuracy (usually 0)
    uint32_t snaplen;         // max packet size (usually 65535)
    uint32_t network;         // link type (1 = Ethernet)
};

PacketParser

Parses raw bytes into structured data with manual bit manipulation:

// Extract IP version and header length from first byte
uint8_t ip_byte  = data[14];
uint8_t version  = (ip_byte >> 4) & 0x0F;   // top 4 bits
uint8_t ihl      = ip_byte & 0x0F;           // bottom 4 bits
size_t  ip_hdr_len = ihl * 4;                // IHL is in 32-bit words

SNIExtractor

Manually parses TLS binary format byte-by-byte:

TLS Record [byte 0]:     0x16 = Content Type: Handshake
TLS Record [bytes 1-2]:  0x03 0x01 = Version (TLS 1.0 compat)
TLS Record [bytes 3-4]:  Length of handshake data

Handshake [byte 5]:      0x01 = Client Hello
Handshake [bytes 6-8]:   3-byte length

Client Hello [bytes 9-10]:  Client version
Client Hello [bytes 11-42]: Random (32 bytes)
Client Hello [byte 43]:     Session ID length
... (cipher suites, compression) ...
Extensions length (2 bytes)
  Extension type 0x0000 = SNI
  Extension data length (2 bytes)
    SNI list length (2 bytes)
      SNI entry type 0x00 = hostname
      SNI name length (2 bytes)
      SNI name bytes ← THIS IS THE DOMAIN NAME

FlowExporter

Thread-safe CSV writer that captures per-flow statistics:

void exportFlow(const Connection& conn) {
    // Compute derived metrics
    uint64_t duration_ms   = last_seen_ms - first_seen_ms;
    double   avg_pkt_size  = total_bytes / total_packets;
    double   pkts_per_sec  = total_packets / (duration_ms / 1000.0);
    double   bytes_per_sec = total_bytes   / (duration_ms / 1000.0);

    // Thread-safe write
    std::lock_guard<std::mutex> lock(mutex_);
    file_ << row_string << "\n";
}

9. How SNI Extraction Works

Your browser                    YouTube Server
     │                               │
     │──── TLS Client Hello ────────►│
     │     ┌───────────────────────┐ │
     │     │ SNI: www.youtube.com  │ │  ← DPI reads this
     │     │ (PLAINTEXT!)          │ │
     │     └───────────────────────┘ │
     │◄─── TLS Server Hello ─────────│
     │     (certificate + params)    │
     │                               │
     │════ Encrypted from here ══════│

Why SNI is plaintext: The server needs to know which domain you're connecting to BEFORE encryption is established — to pick the right TLS certificate. This is by design in the TLS protocol.

What we extract:

Traffic Type	Method	Field Extracted
HTTPS (port 443)	TLS ClientHello parsing	SNI hostname
HTTP (port 80)	HTTP header parsing	Host: header
DNS (port 53)	UDP payload detection	Classified as DNS
Other	Port-based heuristic	App type guess

10. How Blocking Works

Decision Tree

Packet arrives
      │
      ▼
┌─────────────────────────────────┐
│ Is source IP in blocked list?  │──Yes──► DROP
└───────────────┬─────────────────┘
                │No
                ▼
┌─────────────────────────────────┐
│ Is app type in blocked list?   │──Yes──► DROP
└───────────────┬─────────────────┘
                │No
                ▼
┌─────────────────────────────────┐
│ Does SNI match blocked domain? │──Yes──► DROP
└───────────────┬─────────────────┘
                │No
                ▼
            FORWARD

Flow-Based Blocking

Connection to YouTube:
  Packet 1 (SYN)           → No SNI yet, FORWARD
  Packet 2 (SYN-ACK)       → No SNI yet, FORWARD
  Packet 3 (ACK)           → No SNI yet, FORWARD
  Packet 4 (Client Hello)  → SNI: www.youtube.com
                           → App: YOUTUBE (blocked!)
                           → Mark flow as BLOCKED
                           → DROP this packet
  Packet 5 (Data)          → Flow is BLOCKED → DROP
  Packet 6 (Data)          → Flow is BLOCKED → DROP

We can't identify the app until we see the Client Hello. Once identified, all future packets of that flow are dropped automatically.

11. ML Pipeline

Overview

flows.csv (from C++ engine)
       │
       ▼
┌─────────────────────────────────────────────────────────┐
│                  FEATURE ENGINEERING                    │
│  log transforms · protocol encoding · port categories  │
│  bytes/pkt ratio · pps · bps · duration                │
└────────────────────────┬────────────────────────────────┘
                         │
          ┌──────────────┴──────────────┐
          ▼                             ▼
┌──────────────────┐          ┌──────────────────────┐
│  Random Forest   │          │  Isolation Forest    │
│  Classifier      │          │  Anomaly Detector    │
│                  │          │                      │
│  Supervised      │          │  Unsupervised        │
│  13 classes      │          │  No attack labels    │
│  98.95% accuracy │          │  0.625 ROC-AUC       │
└────────┬─────────┘          └──────────┬───────────┘
         │                               │
         └──────────────┬────────────────┘
                        ▼
               ┌──────────────────┐
               │  SHAP Explainer  │
               │  "Why flagged?"  │
               └──────────────────┘

Features Used

Feature	Description	Why Important
`protocol_enc`	TCP=0, UDP=1, ICMP=2	Protocol type is discriminative
`dst_port`	Destination port number	Port 443/80/53 are strong signals
`duration_ms`	Flow duration in ms	Attack flows often short/long
`total_packets`	Packet count	DoS/DDoS have extreme counts
`total_bytes`	Byte count	Streaming vs scanning patterns
`avg_packet_size`	Bytes per packet	Small = scan, Large = stream
`packets_per_second`	Flow rate	BruteForce has high pps
`bytes_per_second`	Bandwidth	DDoS has extreme bps
`log_*` variants	Log-transformed versions	Compress skewed distributions

Running the ML Pipeline

# Step 1: Generate training data
./dpi_engine test_dpi.pcap output.pcap

# Step 2: Train the classifier
python ml/train_classifier.py

# Step 3: Run anomaly detection
python ml/anomaly_detector.py

# Step 4: Explain anomalies with SHAP
python ml/explain_anomaly.py --top 5

# Step 5: Classify a single flow
python ml/predict.py --dst-port 443 --packets 20 --bytes 30000 --duration 5000

# Step 6: CIC-IDS-2017 benchmark (needs dataset)
python ml/preprocess_cicids.py --input data/cicids2017/MachineLearningCVE/
python ml/evaluate_cicids.py --data flows_cicids.csv
python ml/evaluate_anomaly_cicids.py --data flows_cicids.csv

12. Performance & Benchmark Results

Validated on CIC-IDS-2017 — a public benchmark dataset containing 2,830,743 real network flows with labelled benign and attack traffic.

Traffic Classifier (Supervised — Random Forest)

Metric	Score
Test Accuracy	98.95%
Macro F1 Score	84.73%
Weighted F1	99%
5-Fold CV Macro-F1	81.07% ± 2.38%
Training Flows	2,264,594
Test Flows	566,149
Classes	13

Per-Class Performance:

Class	Precision	Recall	F1	Support
DDoS	100%	100%	100%	25,606
DNS	100%	100%	100%	191,563
PortScan	99%	100%	100%	31,786
HTTPS	100%	99%	99%	213,002
DoS	99%	99%	99%	50,532
SSH	99%	99%	99%	2,160
SMTP	98%	99%	98%	756
HTTP	99%	94%	96%	47,139
BruteForce	90%	92%	91%	3,068
Heartbleed	100%	100%	100%	2
Infiltration	100%	71%	83%	7
Botnet	16%	93%	27%*	393
WebAttack	5%	78%	9%*	135

* Low precision due to severe class imbalance. Botnet and WebAttack represent <0.02% of traffic — a known limitation addressed by SMOTE oversampling in production.

Anomaly Detector (Unsupervised — Isolation Forest)

Trained with zero attack labels. Evaluated post-hoc against ground-truth CIC-IDS-2017 labels.

Metric	Score
ROC-AUC	0.6254
Detection Rate (Recall)	21.2%
False Alarm Rate	19.7%
Precision	20.8%
Attack Rate in Dataset	19.7%
Contamination Parameter	0.20

Per-Attack Detection Rate:

Attack Type	Detection Rate	Notes
Heartbleed	100%	Highly anomalous packet structure
Infiltration	91.7%	Rare + unusual behavior pattern
BruteForce	46.6%	High packet rate detectable
DDoS	46.1%	Volume anomaly detectable
DoS	20.0%	Blends with high-traffic flows
PortScan	0.7%	Designed to mimic normal TCP
Botnet	3.0%	Specifically evades detection
WebAttack	2.7%	Looks like normal HTTP traffic

The Two-Layer Security Story:

Supervised (RF)   → Known attacks  → 98.95% accuracy  ← "We know what to look for"
Unsupervised (IF) → Unknown attacks → 0.625 ROC-AUC    ← "We can find new threats too"

This mirrors real-world IDS architectures: a supervised layer for known attack signatures and an anomaly layer for zero-day threats.

13. Dashboard & Demo

Screenshots

Page	Description
Overview	Total flows, packets, bytes, blocked %, top apps bar chart, protocol split
Model Performance	CIC-IDS-2017 metrics, per-class F1 chart, confusion table
Anomaly Detector	Detection rates by attack type, score distribution
Traffic Map	Filterable flow table, top source IPs
Anomaly Detection	Score histogram, scatter plot, top anomalous flows
App Breakdown	Traffic share pie, bandwidth pie, duration box plots
AI Analyst	Natural language queries powered by Groq + Llama 3.1 70B

AI Analyst Sample Queries

"What was the most suspicious traffic in the last 10 minutes?"
"Which IPs were blocked and why?"
"Is there any sign of a port scan or DDoS?"
"Which app generated the most bandwidth?"
"Summarise all anomalies found."

Live Demo Mode

Use generated/test captures and rerun the pipeline to refresh dashboard inputs:

python generate_test_pcap.py
./dpi_engine test_dpi.pcap output.pcap
python ml/anomaly_detector.py
streamlit run dashboard/app.py

14. Building and Running

Prerequisites

Windows: MSYS2 with g++ | macOS/Linux: g++ or clang++
Python 3.11+ for the ML pipeline
No external C++ libraries required

All commands run from the project root (Packet_analyzer/).

Build C++ Engine

cmake -S . -B build
cmake --build build --config Release

Generated binaries include packet_analyzer, dpi_working, dpi_engine, and dpi_mt.

Run the Engine

# Basic (main multi-threaded pipeline)
./dpi_engine test_dpi.pcap output.pcap

# Block a single app
./dpi_engine test_dpi.pcap output.pcap --block-app YouTube

# Block multiple apps + IP + domain
./dpi_engine test_dpi.pcap output.pcap --block-app YouTube --block-app TikTok --block-ip 192.168.1.50 --block-domain facebook

# Configure thread count
./dpi_engine input.pcap output.pcap --lbs 4 --fps 4

Python ML Pipeline

pip install -r requirements.txt

# Train classifier
python ml/train_classifier.py

# Run anomaly detection
python ml/anomaly_detector.py

# Launch dashboard
streamlit run dashboard/app.py

Environment Setup

Create a .env file in the project root (never commit this):

GROQ_API_KEY=gsk_your_key_here

Get a free Groq API key at console.groq.com.

Generate Test Data

python generate_test_pcap.py

GitHub Repository Notes

The repository intentionally excludes:

large datasets
generated artifacts
trained model binaries
PCAP captures

to comply with GitHub storage limits and maintain fast cloning/setup times.

Recommended workflow:

git clone <repo>
pip install -r requirements.txt
cmake -S . -B build
cmake --build build --config Release

Then run the pipeline locally using your own datasets or generated traffic.

15. Understanding the Output

Engine Terminal Output

╔══════════════════════════════════════════════════════════════╗
║              DPI ENGINE v2.0 (Multi-threaded)                ║
╠══════════════════════════════════════════════════════════════╣
║ Load Balancers:  2    FPs per LB:  2    Total FPs:  4        ║
╚══════════════════════════════════════════════════════════════╝

[Rules] Blocked app: YouTube
[Rules] Blocked IP: 192.168.1.50
[Reader] Processing packets...
[Reader] Done reading 77 packets

╔══════════════════════════════════════════════════════════════╗
║                      PROCESSING REPORT                       ║
╠══════════════════════════════════════════════════════════════╣
║ Total Packets:                77                             ║
║ Total Bytes:                5738                             ║
║ TCP Packets:                  73                             ║
║ UDP Packets:                   4                             ║
╠══════════════════════════════════════════════════════════════╣
║ Forwarded:                    69                             ║
║ Dropped:                       8                             ║
╠══════════════════════════════════════════════════════════════╣
║ THREAD STATISTICS                                            ║
║   LB0 dispatched:             53                             ║
║   FP0 processed:              53                             ║
║   FP3 processed:              24                             ║
╠══════════════════════════════════════════════════════════════╣
║                   APPLICATION BREAKDOWN                      ║
╠══════════════════════════════════════════════════════════════╣
║ HTTPS        39  50.6% ##########                            ║
║ YouTube       4   5.2% # (BLOCKED)                           ║
║ DNS           4   5.2% #                                     ║
╚══════════════════════════════════════════════════════════════╝

[Detected Domains/SNIs]
  - www.youtube.com -> YouTube (BLOCKED)
  - www.facebook.com -> Facebook
  - www.google.com -> Google

Output Reference

Section	Meaning
Configuration	Thread pool size
Rules	Active blocking rules
Total Packets	Read from input PCAP
Forwarded	Written to output PCAP
Dropped	Blocked by rules
Thread Statistics	Work distribution across LB/FP threads
Application Breakdown	Traffic classification by app
Detected SNIs	Domain names extracted from TLS handshakes

16. Future Improvements

Near-Term (High ROI)

Item	Description
SMOTE Oversampling	Fix Botnet/WebAttack class imbalance → push Macro-F1 to 92%+
Live libpcap	Capture from real network interface instead of PCAP files
Kubernetes Scaling	Horizontal scaling with k8s deployments for high-throughput environments
Grafana Integration	Replace Streamlit with production Grafana + InfluxDB dashboards

Medium-Term

Item	Description
Kafka Streaming	Replace file-based flow export with real-time Kafka topic pipeline
SIEM Integration	Export alerts to Splunk / Elastic SIEM via CEF/syslog format
QUIC/HTTP3 Support	Detect traffic on UDP port 443 with QUIC Initial packet parsing
Distributed Processing	Multi-node packet processing with shared flow state via Redis

Long-Term (Research)

Item	Description
Transformer Traffic Analysis	Replace Random Forest with FlowTransformer / ET-BERT for sequence modeling
Online Learning	Incremental model updates without full retraining (River ML)
GPU Acceleration	CUDA-based packet parsing for 10Gbps+ line rate processing
Federated Learning	Train across multiple network nodes without centralising raw traffic
Cloud Deployment	AWS/GCP managed deployment with auto-scaling and CloudWatch metrics

Summary

This platform demonstrates a complete AI/cybersecurity engineering stack:

Skill	Implementation
Systems Programming	C++17 multi-threaded packet engine, manual binary parsing
Network Protocols	Ethernet/IP/TCP/UDP/TLS parsing, SNI extraction
Concurrent Programming	Thread pools, lock-free queues, atomic counters
ML Engineering	Feature engineering, Random Forest, Isolation Forest, cross-validation
AI Explainability	SHAP values for anomaly explanation
LLM Integration	Groq API + Llama 3.1 70B for natural language analytics
Data Engineering	C++ → CSV → Python bridge, real-world benchmark evaluation
Full-Stack	Streamlit dashboard with 5 interactive pages
DevOps	CMake-based build workflow
Research Validation	CIC-IDS-2017 benchmark, 2.83M flows, published metrics

The key insight: even HTTPS traffic leaks the destination domain in the TLS handshake, enabling identification and control of application usage — and that signal, combined with flow-level statistics, is powerful enough to train production-grade ML models.

Built with C++17 · Python 3.11 · scikit-learn · Streamlit · Groq

If this project helped you, consider giving it a ⭐

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
assets		assets
dashboard		dashboard
include		include
ml		ml
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
WINDOWS_SETUP.md		WINDOWS_SETUP.md
generate_test_pcap.py		generate_test_pcap.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation