Skip to content

rajadityaaa/dpi-network-intelligence-platform

Repository files navigation

🔍 DPI Network Intelligence Platform

AI-Powered Deep Packet Inspection · Real-Time Traffic Analysis · Cybersecurity ML

C++17 Python Streamlit scikit-learn SHAP Groq LLM Cybersecurity Multithreading Status AI/ML


A production-grade AI/cybersecurity platform that combines a C++17 multi-threaded DPI engine with an ML pipeline for real-time traffic classification, unsupervised anomaly detection, SHAP explainability, and an LLM-powered network analyst — validated on 2.8M real network flows from the CIC-IDS-2017 benchmark.


🏆 98.95% Accuracy 🔬 2.83M Flows 🚨 0.625 ROC-AUC ⚡ 4 Worker Threads 🤖 LLM Analyst
CIC-IDS-2017 RF Real Benchmark Isolation Forest LB → FP Pipeline Groq + Llama 3.1

📸 Screenshots

📊 Overview

Total flows, packets, bytes, top apps, protocol split
🎯 Model Performance

CIC-IDS-2017 benchmark — 98.95% accuracy, per-class F1
🚨 Anomaly Detector

Isolation Forest — 100% Heartbleed, 91.7% Infiltration detection
🗂️ Traffic Map

Filterable live flow table with SNI, ports, bytes, duration
🔴 Anomaly Detection

Score distribution, anomalies by app, 5% anomaly rate
🤖 AI Analyst

Natural language queries via Groq + Llama 3.1 70B

📋 Table of Contents

  1. What is DPI?
  2. Networking Background
  3. Project Overview
  4. System Architecture
  5. File Structure
  6. The Journey of a Packet — Simple Version
  7. The Journey of a Packet — Multi-threaded Version
  8. Deep Dive: Each Component
  9. How SNI Extraction Works
  10. How Blocking Works
  11. ML Pipeline
  12. Performance & Benchmark Results
  13. Dashboard & Demo
  14. Building and Running
  15. Understanding the Output
  16. Future Improvements

⚠️ Repository Notes

Dataset & Model Files

Large datasets, generated flow files, PCAP captures, and trained ML model artifacts are intentionally excluded from this repository to keep the project lightweight and GitHub-friendly.

Excluded assets include:

  • CIC-IDS-2017 raw datasets
  • Generated CSV flow exports
  • PCAP capture files
  • Large .pkl / .joblib trained models

This repository focuses on:

  • Source code
  • System architecture
  • ML pipeline implementation
  • Dashboard components
  • Documentation and reproducibility

Reproducing Results

To reproduce the benchmark results locally:

# Generate test traffic
python generate_test_pcap.py

# Run DPI engine
./dpi_engine test_dpi.pcap output.pcap

# Train classifier
python ml/train_classifier.py

# Run anomaly detection
python ml/anomaly_detector.py

# Launch dashboard
streamlit run dashboard/app.py

CIC-IDS-2017 Dataset

The project was evaluated using the public CIC-IDS-2017 benchmark dataset:

https://www.unb.ca/cic/datasets/ids-2017.html

After downloading the dataset:

python ml/preprocess_cicids.py --input data/cicids2017/MachineLearningCVE/
python ml/evaluate_cicids.py --data flows_cicids.csv
python ml/evaluate_anomaly_cicids.py --data flows_cicids.csv

Deployment Status

Current repository focus:

  • High-performance DPI engine
  • ML traffic analysis
  • Explainable AI
  • Dashboard visualization
  • AI traffic analyst

Planned future deployment support:

  • Docker containerization
  • docker-compose orchestration
  • Kubernetes scaling
  • Cloud-native deployment
  • Kafka streaming pipeline

1. What is DPI?

Deep Packet Inspection (DPI) is a technology used to examine the contents of network packets as they pass through a checkpoint. Unlike simple firewalls that only look at packet headers (source/destination IP), DPI looks inside the packet payload.

Real-World Uses

Industry Use Case
ISPs Throttle or block certain applications (e.g., BitTorrent)
Enterprises Block social media on office networks
Parental Controls Block inappropriate websites
Security Detect malware or intrusion attempts

What This Platform Does

User Traffic (PCAP) ──► [C++ DPI Engine] ──► Filtered Traffic (PCAP)
                               │
                               ▼
                    ┌─────────────────────┐
                    │  Flow CSV Export    │
                    └──────────┬──────────┘
                               │
              ┌────────────────┼────────────────┐
              ▼                ▼                ▼
    [RF Classifier]   [Isolation Forest]   [SHAP Explainer]
    Traffic ID         Anomaly Detection    Why flagged?
              │                │                │
              └────────────────┼────────────────┘
                               ▼
                    [Streamlit Dashboard]
                    [LLM AI Analyst]

2. Networking Background

The Network Stack

When you visit a website, data travels through multiple layers:

┌─────────────────────────────────────────────────────────┐
│ Layer 7: Application    │ HTTP, TLS, DNS               │
├─────────────────────────────────────────────────────────┤
│ Layer 4: Transport      │ TCP (reliable), UDP (fast)   │
├─────────────────────────────────────────────────────────┤
│ Layer 3: Network        │ IP addresses (routing)       │
├─────────────────────────────────────────────────────────┤
│ Layer 2: Data Link      │ MAC addresses (local network)│
└─────────────────────────────────────────────────────────┘

A Packet's Structure

Every network packet is like a Russian nesting doll — headers wrapped inside headers:

┌──────────────────────────────────────────────────────────────────┐
│ Ethernet Header (14 bytes)                                       │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ IP Header (20 bytes)                                         │ │
│ │ ┌──────────────────────────────────────────────────────────┐ │ │
│ │ │ TCP Header (20 bytes)                                    │ │ │
│ │ │ ┌──────────────────────────────────────────────────────┐ │ │ │
│ │ │ │ Payload (Application Data)                           │ │ │ │
│ │ │ │ e.g., TLS Client Hello with SNI                      │ │ │ │
│ │ │ └──────────────────────────────────────────────────────┘ │ │ │
│ │ └──────────────────────────────────────────────────────────┘ │ │
│ └──────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────┘

The Five-Tuple

A connection (or "flow") is uniquely identified by 5 values:

Field Example Purpose
Source IP 192.168.1.100 Who is sending
Destination IP 172.217.14.206 Where it's going
Source Port 54321 Sender's application identifier
Destination Port 443 Service being accessed (443 = HTTPS)
Protocol TCP (6) TCP or UDP

All packets with the same 5-tuple belong to the same connection. This is how the engine tracks conversations and applies flow-level blocking.

What is SNI?

Server Name Indication (SNI) is part of the TLS/HTTPS handshake. When you visit https://www.youtube.com:

  1. Your browser sends a "Client Hello" message
  2. This message includes the domain name in plaintext (not encrypted yet!)
  3. The server uses this to know which certificate to send
TLS Client Hello:
├── Version: TLS 1.2
├── Random: [32 bytes]
├── Cipher Suites: [list]
└── Extensions:
    └── SNI Extension:
        └── Server Name: "www.youtube.com"  ← Extracted here!

This is the key to DPI: Even though HTTPS is encrypted, the domain name is visible in the first packet.


3. Project Overview

Two C++ Engine Versions

Version File Use Case
Simple (Single-threaded) src/main_working.cpp Learning, small captures
Multi-threaded (pipeline engine) src/main_dpi.cpp Main production-style DPI engine
Multi-threaded (self-contained) src/dpi_mt.cpp Compact standalone variant

Full Platform Stack

Layer Technology Purpose
Packet Engine C++17, pthreads Parse, classify, block traffic
Flow Export CSV via FlowExporter Bridge C++ → Python
ML Classifier Random Forest (sklearn) 13-class traffic identification
Anomaly Detector Isolation Forest (sklearn) Unsupervised zero-day detection
Explainability SHAP Why was a flow flagged?
Dashboard Streamlit + Plotly Live visual analytics
AI Analyst Groq + Llama 3.1 70B Natural language traffic queries
Benchmark CIC-IDS-2017 2.83M real-world flow validation

4. System Architecture

ASCII Architecture Diagram

 ┌─────────────────────────────────────────────────────────────────────┐
 │                    DPI NETWORK INTELLIGENCE PLATFORM                │
 └─────────────────────────────────────────────────────────────────────┘

  [PCAP Input File / Live Interface]
           │
           ▼
  ┌─────────────────┐
  │  PCAP Reader    │  Reads packet bytes + timestamps
  │  Thread         │
  └────────┬────────┘
           │  RawPacket structs
           ▼
  ┌─────────────────────────────────────┐
  │         LOAD BALANCER POOL          │
  │  ┌──────────┐     ┌──────────┐      │
  │  │   LB-0   │     │   LB-1   │      │  Hash(5-tuple) % N
  │  └────┬─────┘     └────┬─────┘      │  ensures same flow →
  └───────┼────────────────┼────────────┘  same FP thread
          │                │
     ┌────┘     ┌──────────┘
     ▼          ▼
  ┌──────────────────────────────────────────────────┐
  │              FAST PATH THREAD POOL               │
  │  ┌────────┐  ┌────────┐  ┌────────┐  ┌────────┐ │
  │  │  FP-0  │  │  FP-1  │  │  FP-2  │  │  FP-3  │ │
  │  │        │  │        │  │        │  │        │ │
  │  │ Parse  │  │ Parse  │  │ Parse  │  │ Parse  │ │
  │  │Classify│  │Classify│  │Classify│  │Classify│ │
  │  │ Block  │  │ Block  │  │ Block  │  │ Block  │ │
  │  └───┬────┘  └───┬────┘  └───┬────┘  └───┬────┘ │
  └──────┼───────────┼───────────┼────────────┼──────┘
         │           │           │            │
         └───────────┴─────┬─────┴────────────┘
                           │
              ┌────────────┴───────────┐
              │                        │
              ▼                        ▼
   ┌─────────────────┐      ┌─────────────────────┐
   │  Output Queue   │      │   Flow CSV Export   │
   │  (filtered      │      │   flows.csv         │
   │   packets)      │      │   (per flow stats)  │
   └────────┬────────┘      └──────────┬──────────┘
            │                          │
            ▼                          ▼
   ┌─────────────────┐      ┌──────────────────────────────────────────┐
   │  output.pcap    │      │           PYTHON ML PIPELINE             │
   └─────────────────┘      │                                          │
                            │  ┌──────────────┐  ┌──────────────────┐ │
                            │  │ Random Forest│  │ Isolation Forest │ │
                            │  │ Classifier   │  │ Anomaly Detector │ │
                            │  │ 98.95% acc   │  │ 0.625 ROC-AUC   │ │
                            │  └──────┬───────┘  └────────┬─────────┘ │
                            │         │                    │           │
                            │         └─────────┬──────────┘           │
                            │                   ▼                      │
                            │         ┌──────────────────┐             │
                            │         │ SHAP Explainer   │             │
                            │         │ (Why flagged?)   │             │
                            │         └──────────────────┘             │
                            └──────────────────┬───────────────────────┘
                                               │
                                               ▼
                            ┌──────────────────────────────────────────┐
                            │         STREAMLIT DASHBOARD              │
                            │  Overview │ Traffic Map │ Anomalies      │
                            │  App Breakdown │ 🤖 AI Analyst           │
                            │  (Groq + Llama 3.1 70B)                 │
                            └──────────────────────────────────────────┘

Mermaid Architecture Diagram

flowchart TD
    A[📁 PCAP File / Live Interface] --> B[PCAP Reader Thread]
    B --> C{Load Balancer Pool\nHash 5-tuple % N}
    C --> D[LB-0]
    C --> E[LB-1]
    D --> F[FP-0]
    D --> G[FP-1]
    E --> H[FP-2]
    E --> I[FP-3]
    F & G & H & I --> J[Output Queue]
    F & G & H & I --> K[Flow CSV Export\nflows.csv]
    J --> L[📄 output.pcap]
    K --> M[🌲 Random Forest\nClassifier\n98.95% acc]
    K --> N[🌳 Isolation Forest\nAnomaly Detector\n0.625 AUC]
    M --> O[SHAP Explainer]
    N --> O
    O --> P[📊 Streamlit Dashboard]
    P --> Q[🤖 AI Analyst\nGroq + Llama 3.1]

    style A fill:#1e3a5f,color:#fff
    style P fill:#1e3a5f,color:#fff
    style Q fill:#4a1a5f,color:#fff
    style M fill:#1a3a1f,color:#fff
    style N fill:#1a3a1f,color:#fff
Loading

5. File Structure

Packet_analyzer/
│
├── 📁 include/                     # C++ Header files
│   ├── pcap_reader.h               # PCAP file reading
│   ├── packet_parser.h             # Network protocol parsing
│   ├── sni_extractor.h             # TLS/HTTP deep inspection
│   ├── types.h                     # Core data structures
│   ├── flow_exporter.h             # CSV flow export (Step 2)
│   ├── rule_manager.h              # Blocking rules
│   ├── connection_tracker.h        # Stateful flow tracking
│   ├── load_balancer.h             # LB thread management
│   ├── fast_path.h                 # FP thread processing
│   └── dpi_engine.h                # Main orchestrator
│
├── 📁 src/                         # C++ Implementation
│   ├── pcap_reader.cpp
│   ├── packet_parser.cpp
│   ├── sni_extractor.cpp
│   ├── flow_exporter.cpp           # Thread-safe CSV export
│   ├── types.cpp
│   ├── main_working.cpp            # ★ Simple version
│   ├── main_dpi.cpp                # ★ Main multi-threaded pipeline
│   └── dpi_mt.cpp                  # ★ Self-contained multi-threaded variant
│
├── 📁 ml/                          # Python ML Pipeline
│   ├── train_classifier.py         # Random Forest training
│   ├── predict.py                  # Single-flow prediction
│   ├── anomaly_detector.py         # Isolation Forest
│   ├── explain_anomaly.py          # SHAP explainability
│   ├── preprocess_cicids.py        # CIC-IDS-2017 preprocessor
│   ├── evaluate_cicids.py          # Benchmark evaluation
│   ├── evaluate_anomaly_cicids.py  # Anomaly benchmark
│   └── models/                     # Saved .pkl models
│       ├── traffic_classifier.pkl
│       ├── anomaly_detector.pkl
│       └── confusion_matrix.png
│
├── 📁 dashboard/                   # Streamlit Dashboard
│   ├── app.py                      # 5-page dashboard
│   └── llm_analyst.py              # Groq AI analyst
│
├── 📁 tests/                       # Google Test unit tests
│   ├── test_packet_parser.cpp
│   └── test_sni_extractor.cpp
│
├── 📁 data/                        # Dataset (not tracked)
│   └── cicids2017/                 # CIC-IDS-2017 CSVs
│
├── requirements.txt                # Python dependencies
├── generate_test_pcap.py           # Synthetic PCAP generator
├── flows.csv                       # Generated flow data
├── flows_with_anomalies.csv        # Enriched with anomaly scores
├── test_dpi.pcap                   # Sample capture
├── CMakeLists.txt                  # CMake build config
└── README.md

6. The Journey of a Packet — Simple Version

Tracing a single packet through main_working.cpp:

Step 1: Read PCAP File

PcapReader reader;
reader.open("capture.pcap");

PCAP File Format:

┌────────────────────────────┐
│ Global Header (24 bytes)   │  ← Read once at start
├────────────────────────────┤
│ Packet Header (16 bytes)   │  ← Timestamp, length
│ Packet Data (variable)     │  ← Actual network bytes
├────────────────────────────┤
│ Packet Header (16 bytes)   │
│ Packet Data (variable)     │
└────────────────────────────┘

Step 2: Parse Protocol Headers

PacketParser::parse(raw, parsed);
raw.data bytes:
[0-13]   Ethernet Header  → parsed.src_mac, parsed.dest_mac, ether_type
[14-33]  IPv4 Header      → parsed.src_ip, parsed.dest_ip, parsed.protocol
[34-53]  TCP Header       → parsed.src_port, parsed.dest_port, parsed.tcp_flags
[54+]    Payload          → TLS ClientHello, HTTP headers, etc.

Step 3: Extract SNI

auto sni = SNIExtractor::extract(payload, payload_len);
// Returns: std::optional<std::string>
// Example: "www.youtube.com"

Step 4: Classify Application

AppType app = sniToAppType(sni.value());
// "www.youtube.com" → AppType::YOUTUBE
// "www.facebook.com" → AppType::FACEBOOK

Step 5: Apply Rules + Forward/Drop

if (rules.isBlocked(src_ip, app, sni)) {
    stats.dropped++;
    continue;   // Don't write to output
}
output.write(packet);
stats.forwarded++;

7. The Journey of a Packet — Multi-threaded Version

Thread Architecture

Main Thread
    │
    ├── Spawn: LB Thread 0 ──► [TSQueue] ──► FP Thread 0
    │                     └──► [TSQueue] ──► FP Thread 1
    │
    ├── Spawn: LB Thread 1 ──► [TSQueue] ──► FP Thread 2
    │                     └──► [TSQueue] ──► FP Thread 3
    │
    ├── Spawn: Output Thread (writes filtered packets)
    │
    └── Reader Loop: reads packets → dispatches to LBs

Why Hash-Based Load Balancing?

size_t lb_idx = FiveTupleHash()(pkt.tuple) % num_lbs;
lbs_[lb_idx]->queue().push(pkt);

The key insight: All packets of the same TCP connection have the same 5-tuple. By hashing the 5-tuple, we guarantee that all packets of one connection always go to the same FP thread. This means:

  • ✅ No race conditions on flow state
  • ✅ No mutexes needed per flow
  • ✅ Linear scalability with thread count

Thread-Safe Queue (Lock-Free Design)

Producer (LB Thread)              Consumer (FP Thread)
        │                                 │
        ▼                                 ▼
   mutex.lock()                      mutex.lock()
   queue.push(packet)                packet = queue.front()
   not_empty.notify()                queue.pop()
   mutex.unlock()                    mutex.unlock()
        │                                 │
   not_full.wait()                   not_empty.wait()
   (if queue full)                   (if queue empty)

Fast Path Processing Loop

Each FP thread runs this loop independently:

while (running) {
    pkt = input_queue.pop(timeout=100ms)
    
    if (!pkt) continue   // timeout, check if still running
    
    flow = flows_[pkt.tuple]          // O(1) hash lookup
    
    if (!flow.classified):
        try SNI extraction            // TLS port 443
        try HTTP host extraction      // HTTP port 80
        try DNS detection             // port 53
    
    if (!flow.blocked):
        flow.blocked = rules.check(src_ip, app, sni)
    
    if (flow.blocked):
        stats.dropped++
    else:
        output_queue.push(pkt)
        stats.forwarded++
    
    if (TCP FIN or RST):
        exportFlow(flow)              // write to flows.csv
        flows_.erase(tuple)           // cleanup
}

8. Deep Dive: Each Component

PcapReader

Reads binary PCAP files in two steps:

  1. Global Header (24 bytes): Magic number (validates file), version, timestamp precision, max packet size, link type
  2. Per-Packet: 16-byte header (timestamps + lengths) followed by raw packet data
struct PcapGlobalHeader {
    uint32_t magic_number;    // 0xa1b2c3d4 = valid PCAP
    uint16_t version_major;   // usually 2
    uint16_t version_minor;   // usually 4
    int32_t  thiszone;        // timezone (usually 0)
    uint32_t sigfigs;         // timestamp accuracy (usually 0)
    uint32_t snaplen;         // max packet size (usually 65535)
    uint32_t network;         // link type (1 = Ethernet)
};

PacketParser

Parses raw bytes into structured data with manual bit manipulation:

// Extract IP version and header length from first byte
uint8_t ip_byte  = data[14];
uint8_t version  = (ip_byte >> 4) & 0x0F;   // top 4 bits
uint8_t ihl      = ip_byte & 0x0F;           // bottom 4 bits
size_t  ip_hdr_len = ihl * 4;                // IHL is in 32-bit words

SNIExtractor

Manually parses TLS binary format byte-by-byte:

TLS Record [byte 0]:     0x16 = Content Type: Handshake
TLS Record [bytes 1-2]:  0x03 0x01 = Version (TLS 1.0 compat)
TLS Record [bytes 3-4]:  Length of handshake data

Handshake [byte 5]:      0x01 = Client Hello
Handshake [bytes 6-8]:   3-byte length

Client Hello [bytes 9-10]:  Client version
Client Hello [bytes 11-42]: Random (32 bytes)
Client Hello [byte 43]:     Session ID length
... (cipher suites, compression) ...
Extensions length (2 bytes)
  Extension type 0x0000 = SNI
  Extension data length (2 bytes)
    SNI list length (2 bytes)
      SNI entry type 0x00 = hostname
      SNI name length (2 bytes)
      SNI name bytes ← THIS IS THE DOMAIN NAME

FlowExporter

Thread-safe CSV writer that captures per-flow statistics:

void exportFlow(const Connection& conn) {
    // Compute derived metrics
    uint64_t duration_ms   = last_seen_ms - first_seen_ms;
    double   avg_pkt_size  = total_bytes / total_packets;
    double   pkts_per_sec  = total_packets / (duration_ms / 1000.0);
    double   bytes_per_sec = total_bytes   / (duration_ms / 1000.0);

    // Thread-safe write
    std::lock_guard<std::mutex> lock(mutex_);
    file_ << row_string << "\n";
}

9. How SNI Extraction Works

Your browser                    YouTube Server
     │                               │
     │──── TLS Client Hello ────────►│
     │     ┌───────────────────────┐ │
     │     │ SNI: www.youtube.com  │ │  ← DPI reads this
     │     │ (PLAINTEXT!)          │ │
     │     └───────────────────────┘ │
     │◄─── TLS Server Hello ─────────│
     │     (certificate + params)    │
     │                               │
     │════ Encrypted from here ══════│

Why SNI is plaintext: The server needs to know which domain you're connecting to BEFORE encryption is established — to pick the right TLS certificate. This is by design in the TLS protocol.

What we extract:

Traffic Type Method Field Extracted
HTTPS (port 443) TLS ClientHello parsing SNI hostname
HTTP (port 80) HTTP header parsing Host: header
DNS (port 53) UDP payload detection Classified as DNS
Other Port-based heuristic App type guess

10. How Blocking Works

Decision Tree

Packet arrives
      │
      ▼
┌─────────────────────────────────┐
│ Is source IP in blocked list?  │──Yes──► DROP
└───────────────┬─────────────────┘
                │No
                ▼
┌─────────────────────────────────┐
│ Is app type in blocked list?   │──Yes──► DROP
└───────────────┬─────────────────┘
                │No
                ▼
┌─────────────────────────────────┐
│ Does SNI match blocked domain? │──Yes──► DROP
└───────────────┬─────────────────┘
                │No
                ▼
            FORWARD

Flow-Based Blocking

Connection to YouTube:
  Packet 1 (SYN)           → No SNI yet, FORWARD
  Packet 2 (SYN-ACK)       → No SNI yet, FORWARD
  Packet 3 (ACK)           → No SNI yet, FORWARD
  Packet 4 (Client Hello)  → SNI: www.youtube.com
                           → App: YOUTUBE (blocked!)
                           → Mark flow as BLOCKED
                           → DROP this packet
  Packet 5 (Data)          → Flow is BLOCKED → DROP
  Packet 6 (Data)          → Flow is BLOCKED → DROP

We can't identify the app until we see the Client Hello. Once identified, all future packets of that flow are dropped automatically.


11. ML Pipeline

Overview

flows.csv (from C++ engine)
       │
       ▼
┌─────────────────────────────────────────────────────────┐
│                  FEATURE ENGINEERING                    │
│  log transforms · protocol encoding · port categories  │
│  bytes/pkt ratio · pps · bps · duration                │
└────────────────────────┬────────────────────────────────┘
                         │
          ┌──────────────┴──────────────┐
          ▼                             ▼
┌──────────────────┐          ┌──────────────────────┐
│  Random Forest   │          │  Isolation Forest    │
│  Classifier      │          │  Anomaly Detector    │
│                  │          │                      │
│  Supervised      │          │  Unsupervised        │
│  13 classes      │          │  No attack labels    │
│  98.95% accuracy │          │  0.625 ROC-AUC       │
└────────┬─────────┘          └──────────┬───────────┘
         │                               │
         └──────────────┬────────────────┘
                        ▼
               ┌──────────────────┐
               │  SHAP Explainer  │
               │  "Why flagged?"  │
               └──────────────────┘

Features Used

Feature Description Why Important
protocol_enc TCP=0, UDP=1, ICMP=2 Protocol type is discriminative
dst_port Destination port number Port 443/80/53 are strong signals
duration_ms Flow duration in ms Attack flows often short/long
total_packets Packet count DoS/DDoS have extreme counts
total_bytes Byte count Streaming vs scanning patterns
avg_packet_size Bytes per packet Small = scan, Large = stream
packets_per_second Flow rate BruteForce has high pps
bytes_per_second Bandwidth DDoS has extreme bps
log_* variants Log-transformed versions Compress skewed distributions

Running the ML Pipeline

# Step 1: Generate training data
./dpi_engine test_dpi.pcap output.pcap

# Step 2: Train the classifier
python ml/train_classifier.py

# Step 3: Run anomaly detection
python ml/anomaly_detector.py

# Step 4: Explain anomalies with SHAP
python ml/explain_anomaly.py --top 5

# Step 5: Classify a single flow
python ml/predict.py --dst-port 443 --packets 20 --bytes 30000 --duration 5000

# Step 6: CIC-IDS-2017 benchmark (needs dataset)
python ml/preprocess_cicids.py --input data/cicids2017/MachineLearningCVE/
python ml/evaluate_cicids.py --data flows_cicids.csv
python ml/evaluate_anomaly_cicids.py --data flows_cicids.csv

12. Performance & Benchmark Results

Validated on CIC-IDS-2017 — a public benchmark dataset containing 2,830,743 real network flows with labelled benign and attack traffic.

Traffic Classifier (Supervised — Random Forest)

Metric Score
Test Accuracy 98.95%
Macro F1 Score 84.73%
Weighted F1 99%
5-Fold CV Macro-F1 81.07% ± 2.38%
Training Flows 2,264,594
Test Flows 566,149
Classes 13

Per-Class Performance:

Class Precision Recall F1 Support
DDoS 100% 100% 100% 25,606
DNS 100% 100% 100% 191,563
PortScan 99% 100% 100% 31,786
HTTPS 100% 99% 99% 213,002
DoS 99% 99% 99% 50,532
SSH 99% 99% 99% 2,160
SMTP 98% 99% 98% 756
HTTP 99% 94% 96% 47,139
BruteForce 90% 92% 91% 3,068
Heartbleed 100% 100% 100% 2
Infiltration 100% 71% 83% 7
Botnet 16% 93% 27%* 393
WebAttack 5% 78% 9%* 135

* Low precision due to severe class imbalance. Botnet and WebAttack represent <0.02% of traffic — a known limitation addressed by SMOTE oversampling in production.

Anomaly Detector (Unsupervised — Isolation Forest)

Trained with zero attack labels. Evaluated post-hoc against ground-truth CIC-IDS-2017 labels.

Metric Score
ROC-AUC 0.6254
Detection Rate (Recall) 21.2%
False Alarm Rate 19.7%
Precision 20.8%
Attack Rate in Dataset 19.7%
Contamination Parameter 0.20

Per-Attack Detection Rate:

Attack Type Detection Rate Notes
Heartbleed 100% Highly anomalous packet structure
Infiltration 91.7% Rare + unusual behavior pattern
BruteForce 46.6% High packet rate detectable
DDoS 46.1% Volume anomaly detectable
DoS 20.0% Blends with high-traffic flows
PortScan 0.7% Designed to mimic normal TCP
Botnet 3.0% Specifically evades detection
WebAttack 2.7% Looks like normal HTTP traffic

The Two-Layer Security Story:

Supervised (RF)   → Known attacks  → 98.95% accuracy  ← "We know what to look for"
Unsupervised (IF) → Unknown attacks → 0.625 ROC-AUC    ← "We can find new threats too"

This mirrors real-world IDS architectures: a supervised layer for known attack signatures and an anomaly layer for zero-day threats.


13. Dashboard & Demo

Screenshots

Page Description
Overview Total flows, packets, bytes, blocked %, top apps bar chart, protocol split
Model Performance CIC-IDS-2017 metrics, per-class F1 chart, confusion table
Anomaly Detector Detection rates by attack type, score distribution
Traffic Map Filterable flow table, top source IPs
Anomaly Detection Score histogram, scatter plot, top anomalous flows
App Breakdown Traffic share pie, bandwidth pie, duration box plots
AI Analyst Natural language queries powered by Groq + Llama 3.1 70B

AI Analyst Sample Queries

"What was the most suspicious traffic in the last 10 minutes?"
"Which IPs were blocked and why?"
"Is there any sign of a port scan or DDoS?"
"Which app generated the most bandwidth?"
"Summarise all anomalies found."

Live Demo Mode

Use generated/test captures and rerun the pipeline to refresh dashboard inputs:

python generate_test_pcap.py
./dpi_engine test_dpi.pcap output.pcap
python ml/anomaly_detector.py
streamlit run dashboard/app.py

14. Building and Running

Prerequisites

  • Windows: MSYS2 with g++ | macOS/Linux: g++ or clang++
  • Python 3.11+ for the ML pipeline
  • No external C++ libraries required

All commands run from the project root (Packet_analyzer/).

Build C++ Engine

cmake -S . -B build
cmake --build build --config Release

Generated binaries include packet_analyzer, dpi_working, dpi_engine, and dpi_mt.

Run the Engine

# Basic (main multi-threaded pipeline)
./dpi_engine test_dpi.pcap output.pcap

# Block a single app
./dpi_engine test_dpi.pcap output.pcap --block-app YouTube

# Block multiple apps + IP + domain
./dpi_engine test_dpi.pcap output.pcap --block-app YouTube --block-app TikTok --block-ip 192.168.1.50 --block-domain facebook

# Configure thread count
./dpi_engine input.pcap output.pcap --lbs 4 --fps 4

Python ML Pipeline

pip install -r requirements.txt

# Train classifier
python ml/train_classifier.py

# Run anomaly detection
python ml/anomaly_detector.py

# Launch dashboard
streamlit run dashboard/app.py

Environment Setup

Create a .env file in the project root (never commit this):

GROQ_API_KEY=gsk_your_key_here

Get a free Groq API key at console.groq.com.

Generate Test Data

python generate_test_pcap.py

GitHub Repository Notes

The repository intentionally excludes:

  • large datasets
  • generated artifacts
  • trained model binaries
  • PCAP captures

to comply with GitHub storage limits and maintain fast cloning/setup times.

Recommended workflow:

git clone <repo>
pip install -r requirements.txt
cmake -S . -B build
cmake --build build --config Release

Then run the pipeline locally using your own datasets or generated traffic.

15. Understanding the Output

Engine Terminal Output

╔══════════════════════════════════════════════════════════════╗
║              DPI ENGINE v2.0 (Multi-threaded)                ║
╠══════════════════════════════════════════════════════════════╣
║ Load Balancers:  2    FPs per LB:  2    Total FPs:  4        ║
╚══════════════════════════════════════════════════════════════╝

[Rules] Blocked app: YouTube
[Rules] Blocked IP: 192.168.1.50
[Reader] Processing packets...
[Reader] Done reading 77 packets

╔══════════════════════════════════════════════════════════════╗
║                      PROCESSING REPORT                       ║
╠══════════════════════════════════════════════════════════════╣
║ Total Packets:                77                             ║
║ Total Bytes:                5738                             ║
║ TCP Packets:                  73                             ║
║ UDP Packets:                   4                             ║
╠══════════════════════════════════════════════════════════════╣
║ Forwarded:                    69                             ║
║ Dropped:                       8                             ║
╠══════════════════════════════════════════════════════════════╣
║ THREAD STATISTICS                                            ║
║   LB0 dispatched:             53                             ║
║   FP0 processed:              53                             ║
║   FP3 processed:              24                             ║
╠══════════════════════════════════════════════════════════════╣
║                   APPLICATION BREAKDOWN                      ║
╠══════════════════════════════════════════════════════════════╣
║ HTTPS        39  50.6% ##########                            ║
║ YouTube       4   5.2% # (BLOCKED)                           ║
║ DNS           4   5.2% #                                     ║
╚══════════════════════════════════════════════════════════════╝

[Detected Domains/SNIs]
  - www.youtube.com -> YouTube (BLOCKED)
  - www.facebook.com -> Facebook
  - www.google.com -> Google

Output Reference

Section Meaning
Configuration Thread pool size
Rules Active blocking rules
Total Packets Read from input PCAP
Forwarded Written to output PCAP
Dropped Blocked by rules
Thread Statistics Work distribution across LB/FP threads
Application Breakdown Traffic classification by app
Detected SNIs Domain names extracted from TLS handshakes

16. Future Improvements

Near-Term (High ROI)

Item Description
SMOTE Oversampling Fix Botnet/WebAttack class imbalance → push Macro-F1 to 92%+
Live libpcap Capture from real network interface instead of PCAP files
Kubernetes Scaling Horizontal scaling with k8s deployments for high-throughput environments
Grafana Integration Replace Streamlit with production Grafana + InfluxDB dashboards

Medium-Term

Item Description
Kafka Streaming Replace file-based flow export with real-time Kafka topic pipeline
SIEM Integration Export alerts to Splunk / Elastic SIEM via CEF/syslog format
QUIC/HTTP3 Support Detect traffic on UDP port 443 with QUIC Initial packet parsing
Distributed Processing Multi-node packet processing with shared flow state via Redis

Long-Term (Research)

Item Description
Transformer Traffic Analysis Replace Random Forest with FlowTransformer / ET-BERT for sequence modeling
Online Learning Incremental model updates without full retraining (River ML)
GPU Acceleration CUDA-based packet parsing for 10Gbps+ line rate processing
Federated Learning Train across multiple network nodes without centralising raw traffic
Cloud Deployment AWS/GCP managed deployment with auto-scaling and CloudWatch metrics

Summary

This platform demonstrates a complete AI/cybersecurity engineering stack:

Skill Implementation
Systems Programming C++17 multi-threaded packet engine, manual binary parsing
Network Protocols Ethernet/IP/TCP/UDP/TLS parsing, SNI extraction
Concurrent Programming Thread pools, lock-free queues, atomic counters
ML Engineering Feature engineering, Random Forest, Isolation Forest, cross-validation
AI Explainability SHAP values for anomaly explanation
LLM Integration Groq API + Llama 3.1 70B for natural language analytics
Data Engineering C++ → CSV → Python bridge, real-world benchmark evaluation
Full-Stack Streamlit dashboard with 5 interactive pages
DevOps CMake-based build workflow
Research Validation CIC-IDS-2017 benchmark, 2.83M flows, published metrics

The key insight: even HTTPS traffic leaks the destination domain in the TLS handshake, enabling identification and control of application usage — and that signal, combined with flow-level statistics, is powerful enough to train production-grade ML models.


Built with C++17 · Python 3.11 · scikit-learn · Streamlit · Groq

If this project helped you, consider giving it a ⭐

About

AI-powered deep packet inspection platform with ML traffic classification, anomaly detection, SHAP explainability, Streamlit analytics, and LLM-assisted network intelligence.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors