A production-grade AI/cybersecurity platform that combines a C++17 multi-threaded DPI engine with an ML pipeline for real-time traffic classification, unsupervised anomaly detection, SHAP explainability, and an LLM-powered network analyst — validated on 2.8M real network flows from the CIC-IDS-2017 benchmark.
| 🏆 98.95% Accuracy | 🔬 2.83M Flows | 🚨 0.625 ROC-AUC | ⚡ 4 Worker Threads | 🤖 LLM Analyst |
|---|---|---|---|---|
| CIC-IDS-2017 RF | Real Benchmark | Isolation Forest | LB → FP Pipeline | Groq + Llama 3.1 |
- What is DPI?
- Networking Background
- Project Overview
- System Architecture
- File Structure
- The Journey of a Packet — Simple Version
- The Journey of a Packet — Multi-threaded Version
- Deep Dive: Each Component
- How SNI Extraction Works
- How Blocking Works
- ML Pipeline
- Performance & Benchmark Results
- Dashboard & Demo
- Building and Running
- Understanding the Output
- Future Improvements
Large datasets, generated flow files, PCAP captures, and trained ML model artifacts are intentionally excluded from this repository to keep the project lightweight and GitHub-friendly.
Excluded assets include:
- CIC-IDS-2017 raw datasets
- Generated CSV flow exports
- PCAP capture files
- Large
.pkl/.joblibtrained models
This repository focuses on:
- Source code
- System architecture
- ML pipeline implementation
- Dashboard components
- Documentation and reproducibility
To reproduce the benchmark results locally:
# Generate test traffic
python generate_test_pcap.py
# Run DPI engine
./dpi_engine test_dpi.pcap output.pcap
# Train classifier
python ml/train_classifier.py
# Run anomaly detection
python ml/anomaly_detector.py
# Launch dashboard
streamlit run dashboard/app.pyThe project was evaluated using the public CIC-IDS-2017 benchmark dataset:
https://www.unb.ca/cic/datasets/ids-2017.html
After downloading the dataset:
python ml/preprocess_cicids.py --input data/cicids2017/MachineLearningCVE/
python ml/evaluate_cicids.py --data flows_cicids.csv
python ml/evaluate_anomaly_cicids.py --data flows_cicids.csvCurrent repository focus:
- High-performance DPI engine
- ML traffic analysis
- Explainable AI
- Dashboard visualization
- AI traffic analyst
Planned future deployment support:
- Docker containerization
- docker-compose orchestration
- Kubernetes scaling
- Cloud-native deployment
- Kafka streaming pipeline
Deep Packet Inspection (DPI) is a technology used to examine the contents of network packets as they pass through a checkpoint. Unlike simple firewalls that only look at packet headers (source/destination IP), DPI looks inside the packet payload.
| Industry | Use Case |
|---|---|
| ISPs | Throttle or block certain applications (e.g., BitTorrent) |
| Enterprises | Block social media on office networks |
| Parental Controls | Block inappropriate websites |
| Security | Detect malware or intrusion attempts |
User Traffic (PCAP) ──► [C++ DPI Engine] ──► Filtered Traffic (PCAP)
│
▼
┌─────────────────────┐
│ Flow CSV Export │
└──────────┬──────────┘
│
┌────────────────┼────────────────┐
▼ ▼ ▼
[RF Classifier] [Isolation Forest] [SHAP Explainer]
Traffic ID Anomaly Detection Why flagged?
│ │ │
└────────────────┼────────────────┘
▼
[Streamlit Dashboard]
[LLM AI Analyst]
When you visit a website, data travels through multiple layers:
┌─────────────────────────────────────────────────────────┐
│ Layer 7: Application │ HTTP, TLS, DNS │
├─────────────────────────────────────────────────────────┤
│ Layer 4: Transport │ TCP (reliable), UDP (fast) │
├─────────────────────────────────────────────────────────┤
│ Layer 3: Network │ IP addresses (routing) │
├─────────────────────────────────────────────────────────┤
│ Layer 2: Data Link │ MAC addresses (local network)│
└─────────────────────────────────────────────────────────┘
Every network packet is like a Russian nesting doll — headers wrapped inside headers:
┌──────────────────────────────────────────────────────────────────┐
│ Ethernet Header (14 bytes) │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ IP Header (20 bytes) │ │
│ │ ┌──────────────────────────────────────────────────────────┐ │ │
│ │ │ TCP Header (20 bytes) │ │ │
│ │ │ ┌──────────────────────────────────────────────────────┐ │ │ │
│ │ │ │ Payload (Application Data) │ │ │ │
│ │ │ │ e.g., TLS Client Hello with SNI │ │ │ │
│ │ │ └──────────────────────────────────────────────────────┘ │ │ │
│ │ └──────────────────────────────────────────────────────────┘ │ │
│ └──────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────┘
A connection (or "flow") is uniquely identified by 5 values:
| Field | Example | Purpose |
|---|---|---|
| Source IP | 192.168.1.100 | Who is sending |
| Destination IP | 172.217.14.206 | Where it's going |
| Source Port | 54321 | Sender's application identifier |
| Destination Port | 443 | Service being accessed (443 = HTTPS) |
| Protocol | TCP (6) | TCP or UDP |
All packets with the same 5-tuple belong to the same connection. This is how the engine tracks conversations and applies flow-level blocking.
Server Name Indication (SNI) is part of the TLS/HTTPS handshake. When you visit https://www.youtube.com:
- Your browser sends a "Client Hello" message
- This message includes the domain name in plaintext (not encrypted yet!)
- The server uses this to know which certificate to send
TLS Client Hello:
├── Version: TLS 1.2
├── Random: [32 bytes]
├── Cipher Suites: [list]
└── Extensions:
└── SNI Extension:
└── Server Name: "www.youtube.com" ← Extracted here!
This is the key to DPI: Even though HTTPS is encrypted, the domain name is visible in the first packet.
| Version | File | Use Case |
|---|---|---|
| Simple (Single-threaded) | src/main_working.cpp |
Learning, small captures |
| Multi-threaded (pipeline engine) | src/main_dpi.cpp |
Main production-style DPI engine |
| Multi-threaded (self-contained) | src/dpi_mt.cpp |
Compact standalone variant |
| Layer | Technology | Purpose |
|---|---|---|
| Packet Engine | C++17, pthreads | Parse, classify, block traffic |
| Flow Export | CSV via FlowExporter |
Bridge C++ → Python |
| ML Classifier | Random Forest (sklearn) | 13-class traffic identification |
| Anomaly Detector | Isolation Forest (sklearn) | Unsupervised zero-day detection |
| Explainability | SHAP | Why was a flow flagged? |
| Dashboard | Streamlit + Plotly | Live visual analytics |
| AI Analyst | Groq + Llama 3.1 70B | Natural language traffic queries |
| Benchmark | CIC-IDS-2017 | 2.83M real-world flow validation |
┌─────────────────────────────────────────────────────────────────────┐
│ DPI NETWORK INTELLIGENCE PLATFORM │
└─────────────────────────────────────────────────────────────────────┘
[PCAP Input File / Live Interface]
│
▼
┌─────────────────┐
│ PCAP Reader │ Reads packet bytes + timestamps
│ Thread │
└────────┬────────┘
│ RawPacket structs
▼
┌─────────────────────────────────────┐
│ LOAD BALANCER POOL │
│ ┌──────────┐ ┌──────────┐ │
│ │ LB-0 │ │ LB-1 │ │ Hash(5-tuple) % N
│ └────┬─────┘ └────┬─────┘ │ ensures same flow →
└───────┼────────────────┼────────────┘ same FP thread
│ │
┌────┘ ┌──────────┘
▼ ▼
┌──────────────────────────────────────────────────┐
│ FAST PATH THREAD POOL │
│ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │
│ │ FP-0 │ │ FP-1 │ │ FP-2 │ │ FP-3 │ │
│ │ │ │ │ │ │ │ │ │
│ │ Parse │ │ Parse │ │ Parse │ │ Parse │ │
│ │Classify│ │Classify│ │Classify│ │Classify│ │
│ │ Block │ │ Block │ │ Block │ │ Block │ │
│ └───┬────┘ └───┬────┘ └───┬────┘ └───┬────┘ │
└──────┼───────────┼───────────┼────────────┼──────┘
│ │ │ │
└───────────┴─────┬─────┴────────────┘
│
┌────────────┴───────────┐
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────────┐
│ Output Queue │ │ Flow CSV Export │
│ (filtered │ │ flows.csv │
│ packets) │ │ (per flow stats) │
└────────┬────────┘ └──────────┬──────────┘
│ │
▼ ▼
┌─────────────────┐ ┌──────────────────────────────────────────┐
│ output.pcap │ │ PYTHON ML PIPELINE │
└─────────────────┘ │ │
│ ┌──────────────┐ ┌──────────────────┐ │
│ │ Random Forest│ │ Isolation Forest │ │
│ │ Classifier │ │ Anomaly Detector │ │
│ │ 98.95% acc │ │ 0.625 ROC-AUC │ │
│ └──────┬───────┘ └────────┬─────────┘ │
│ │ │ │
│ └─────────┬──────────┘ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ SHAP Explainer │ │
│ │ (Why flagged?) │ │
│ └──────────────────┘ │
└──────────────────┬───────────────────────┘
│
▼
┌──────────────────────────────────────────┐
│ STREAMLIT DASHBOARD │
│ Overview │ Traffic Map │ Anomalies │
│ App Breakdown │ 🤖 AI Analyst │
│ (Groq + Llama 3.1 70B) │
└──────────────────────────────────────────┘
flowchart TD
A[📁 PCAP File / Live Interface] --> B[PCAP Reader Thread]
B --> C{Load Balancer Pool\nHash 5-tuple % N}
C --> D[LB-0]
C --> E[LB-1]
D --> F[FP-0]
D --> G[FP-1]
E --> H[FP-2]
E --> I[FP-3]
F & G & H & I --> J[Output Queue]
F & G & H & I --> K[Flow CSV Export\nflows.csv]
J --> L[📄 output.pcap]
K --> M[🌲 Random Forest\nClassifier\n98.95% acc]
K --> N[🌳 Isolation Forest\nAnomaly Detector\n0.625 AUC]
M --> O[SHAP Explainer]
N --> O
O --> P[📊 Streamlit Dashboard]
P --> Q[🤖 AI Analyst\nGroq + Llama 3.1]
style A fill:#1e3a5f,color:#fff
style P fill:#1e3a5f,color:#fff
style Q fill:#4a1a5f,color:#fff
style M fill:#1a3a1f,color:#fff
style N fill:#1a3a1f,color:#fff
Packet_analyzer/
│
├── 📁 include/ # C++ Header files
│ ├── pcap_reader.h # PCAP file reading
│ ├── packet_parser.h # Network protocol parsing
│ ├── sni_extractor.h # TLS/HTTP deep inspection
│ ├── types.h # Core data structures
│ ├── flow_exporter.h # CSV flow export (Step 2)
│ ├── rule_manager.h # Blocking rules
│ ├── connection_tracker.h # Stateful flow tracking
│ ├── load_balancer.h # LB thread management
│ ├── fast_path.h # FP thread processing
│ └── dpi_engine.h # Main orchestrator
│
├── 📁 src/ # C++ Implementation
│ ├── pcap_reader.cpp
│ ├── packet_parser.cpp
│ ├── sni_extractor.cpp
│ ├── flow_exporter.cpp # Thread-safe CSV export
│ ├── types.cpp
│ ├── main_working.cpp # ★ Simple version
│ ├── main_dpi.cpp # ★ Main multi-threaded pipeline
│ └── dpi_mt.cpp # ★ Self-contained multi-threaded variant
│
├── 📁 ml/ # Python ML Pipeline
│ ├── train_classifier.py # Random Forest training
│ ├── predict.py # Single-flow prediction
│ ├── anomaly_detector.py # Isolation Forest
│ ├── explain_anomaly.py # SHAP explainability
│ ├── preprocess_cicids.py # CIC-IDS-2017 preprocessor
│ ├── evaluate_cicids.py # Benchmark evaluation
│ ├── evaluate_anomaly_cicids.py # Anomaly benchmark
│ └── models/ # Saved .pkl models
│ ├── traffic_classifier.pkl
│ ├── anomaly_detector.pkl
│ └── confusion_matrix.png
│
├── 📁 dashboard/ # Streamlit Dashboard
│ ├── app.py # 5-page dashboard
│ └── llm_analyst.py # Groq AI analyst
│
├── 📁 tests/ # Google Test unit tests
│ ├── test_packet_parser.cpp
│ └── test_sni_extractor.cpp
│
├── 📁 data/ # Dataset (not tracked)
│ └── cicids2017/ # CIC-IDS-2017 CSVs
│
├── requirements.txt # Python dependencies
├── generate_test_pcap.py # Synthetic PCAP generator
├── flows.csv # Generated flow data
├── flows_with_anomalies.csv # Enriched with anomaly scores
├── test_dpi.pcap # Sample capture
├── CMakeLists.txt # CMake build config
└── README.md
Tracing a single packet through main_working.cpp:
PcapReader reader;
reader.open("capture.pcap");PCAP File Format:
┌────────────────────────────┐
│ Global Header (24 bytes) │ ← Read once at start
├────────────────────────────┤
│ Packet Header (16 bytes) │ ← Timestamp, length
│ Packet Data (variable) │ ← Actual network bytes
├────────────────────────────┤
│ Packet Header (16 bytes) │
│ Packet Data (variable) │
└────────────────────────────┘
PacketParser::parse(raw, parsed);raw.data bytes:
[0-13] Ethernet Header → parsed.src_mac, parsed.dest_mac, ether_type
[14-33] IPv4 Header → parsed.src_ip, parsed.dest_ip, parsed.protocol
[34-53] TCP Header → parsed.src_port, parsed.dest_port, parsed.tcp_flags
[54+] Payload → TLS ClientHello, HTTP headers, etc.
auto sni = SNIExtractor::extract(payload, payload_len);
// Returns: std::optional<std::string>
// Example: "www.youtube.com"AppType app = sniToAppType(sni.value());
// "www.youtube.com" → AppType::YOUTUBE
// "www.facebook.com" → AppType::FACEBOOKif (rules.isBlocked(src_ip, app, sni)) {
stats.dropped++;
continue; // Don't write to output
}
output.write(packet);
stats.forwarded++;Main Thread
│
├── Spawn: LB Thread 0 ──► [TSQueue] ──► FP Thread 0
│ └──► [TSQueue] ──► FP Thread 1
│
├── Spawn: LB Thread 1 ──► [TSQueue] ──► FP Thread 2
│ └──► [TSQueue] ──► FP Thread 3
│
├── Spawn: Output Thread (writes filtered packets)
│
└── Reader Loop: reads packets → dispatches to LBs
size_t lb_idx = FiveTupleHash()(pkt.tuple) % num_lbs;
lbs_[lb_idx]->queue().push(pkt);The key insight: All packets of the same TCP connection have the same 5-tuple. By hashing the 5-tuple, we guarantee that all packets of one connection always go to the same FP thread. This means:
- ✅ No race conditions on flow state
- ✅ No mutexes needed per flow
- ✅ Linear scalability with thread count
Producer (LB Thread) Consumer (FP Thread)
│ │
▼ ▼
mutex.lock() mutex.lock()
queue.push(packet) packet = queue.front()
not_empty.notify() queue.pop()
mutex.unlock() mutex.unlock()
│ │
not_full.wait() not_empty.wait()
(if queue full) (if queue empty)
Each FP thread runs this loop independently:
while (running) {
pkt = input_queue.pop(timeout=100ms)
if (!pkt) continue // timeout, check if still running
flow = flows_[pkt.tuple] // O(1) hash lookup
if (!flow.classified):
try SNI extraction // TLS port 443
try HTTP host extraction // HTTP port 80
try DNS detection // port 53
if (!flow.blocked):
flow.blocked = rules.check(src_ip, app, sni)
if (flow.blocked):
stats.dropped++
else:
output_queue.push(pkt)
stats.forwarded++
if (TCP FIN or RST):
exportFlow(flow) // write to flows.csv
flows_.erase(tuple) // cleanup
}
Reads binary PCAP files in two steps:
- Global Header (24 bytes): Magic number (validates file), version, timestamp precision, max packet size, link type
- Per-Packet: 16-byte header (timestamps + lengths) followed by raw packet data
struct PcapGlobalHeader {
uint32_t magic_number; // 0xa1b2c3d4 = valid PCAP
uint16_t version_major; // usually 2
uint16_t version_minor; // usually 4
int32_t thiszone; // timezone (usually 0)
uint32_t sigfigs; // timestamp accuracy (usually 0)
uint32_t snaplen; // max packet size (usually 65535)
uint32_t network; // link type (1 = Ethernet)
};Parses raw bytes into structured data with manual bit manipulation:
// Extract IP version and header length from first byte
uint8_t ip_byte = data[14];
uint8_t version = (ip_byte >> 4) & 0x0F; // top 4 bits
uint8_t ihl = ip_byte & 0x0F; // bottom 4 bits
size_t ip_hdr_len = ihl * 4; // IHL is in 32-bit wordsManually parses TLS binary format byte-by-byte:
TLS Record [byte 0]: 0x16 = Content Type: Handshake
TLS Record [bytes 1-2]: 0x03 0x01 = Version (TLS 1.0 compat)
TLS Record [bytes 3-4]: Length of handshake data
Handshake [byte 5]: 0x01 = Client Hello
Handshake [bytes 6-8]: 3-byte length
Client Hello [bytes 9-10]: Client version
Client Hello [bytes 11-42]: Random (32 bytes)
Client Hello [byte 43]: Session ID length
... (cipher suites, compression) ...
Extensions length (2 bytes)
Extension type 0x0000 = SNI
Extension data length (2 bytes)
SNI list length (2 bytes)
SNI entry type 0x00 = hostname
SNI name length (2 bytes)
SNI name bytes ← THIS IS THE DOMAIN NAME
Thread-safe CSV writer that captures per-flow statistics:
void exportFlow(const Connection& conn) {
// Compute derived metrics
uint64_t duration_ms = last_seen_ms - first_seen_ms;
double avg_pkt_size = total_bytes / total_packets;
double pkts_per_sec = total_packets / (duration_ms / 1000.0);
double bytes_per_sec = total_bytes / (duration_ms / 1000.0);
// Thread-safe write
std::lock_guard<std::mutex> lock(mutex_);
file_ << row_string << "\n";
}Your browser YouTube Server
│ │
│──── TLS Client Hello ────────►│
│ ┌───────────────────────┐ │
│ │ SNI: www.youtube.com │ │ ← DPI reads this
│ │ (PLAINTEXT!) │ │
│ └───────────────────────┘ │
│◄─── TLS Server Hello ─────────│
│ (certificate + params) │
│ │
│════ Encrypted from here ══════│
Why SNI is plaintext: The server needs to know which domain you're connecting to BEFORE encryption is established — to pick the right TLS certificate. This is by design in the TLS protocol.
What we extract:
| Traffic Type | Method | Field Extracted |
|---|---|---|
| HTTPS (port 443) | TLS ClientHello parsing | SNI hostname |
| HTTP (port 80) | HTTP header parsing | Host: header |
| DNS (port 53) | UDP payload detection | Classified as DNS |
| Other | Port-based heuristic | App type guess |
Packet arrives
│
▼
┌─────────────────────────────────┐
│ Is source IP in blocked list? │──Yes──► DROP
└───────────────┬─────────────────┘
│No
▼
┌─────────────────────────────────┐
│ Is app type in blocked list? │──Yes──► DROP
└───────────────┬─────────────────┘
│No
▼
┌─────────────────────────────────┐
│ Does SNI match blocked domain? │──Yes──► DROP
└───────────────┬─────────────────┘
│No
▼
FORWARD
Connection to YouTube:
Packet 1 (SYN) → No SNI yet, FORWARD
Packet 2 (SYN-ACK) → No SNI yet, FORWARD
Packet 3 (ACK) → No SNI yet, FORWARD
Packet 4 (Client Hello) → SNI: www.youtube.com
→ App: YOUTUBE (blocked!)
→ Mark flow as BLOCKED
→ DROP this packet
Packet 5 (Data) → Flow is BLOCKED → DROP
Packet 6 (Data) → Flow is BLOCKED → DROP
We can't identify the app until we see the Client Hello. Once identified, all future packets of that flow are dropped automatically.
flows.csv (from C++ engine)
│
▼
┌─────────────────────────────────────────────────────────┐
│ FEATURE ENGINEERING │
│ log transforms · protocol encoding · port categories │
│ bytes/pkt ratio · pps · bps · duration │
└────────────────────────┬────────────────────────────────┘
│
┌──────────────┴──────────────┐
▼ ▼
┌──────────────────┐ ┌──────────────────────┐
│ Random Forest │ │ Isolation Forest │
│ Classifier │ │ Anomaly Detector │
│ │ │ │
│ Supervised │ │ Unsupervised │
│ 13 classes │ │ No attack labels │
│ 98.95% accuracy │ │ 0.625 ROC-AUC │
└────────┬─────────┘ └──────────┬───────────┘
│ │
└──────────────┬────────────────┘
▼
┌──────────────────┐
│ SHAP Explainer │
│ "Why flagged?" │
└──────────────────┘
| Feature | Description | Why Important |
|---|---|---|
protocol_enc |
TCP=0, UDP=1, ICMP=2 | Protocol type is discriminative |
dst_port |
Destination port number | Port 443/80/53 are strong signals |
duration_ms |
Flow duration in ms | Attack flows often short/long |
total_packets |
Packet count | DoS/DDoS have extreme counts |
total_bytes |
Byte count | Streaming vs scanning patterns |
avg_packet_size |
Bytes per packet | Small = scan, Large = stream |
packets_per_second |
Flow rate | BruteForce has high pps |
bytes_per_second |
Bandwidth | DDoS has extreme bps |
log_* variants |
Log-transformed versions | Compress skewed distributions |
# Step 1: Generate training data
./dpi_engine test_dpi.pcap output.pcap
# Step 2: Train the classifier
python ml/train_classifier.py
# Step 3: Run anomaly detection
python ml/anomaly_detector.py
# Step 4: Explain anomalies with SHAP
python ml/explain_anomaly.py --top 5
# Step 5: Classify a single flow
python ml/predict.py --dst-port 443 --packets 20 --bytes 30000 --duration 5000
# Step 6: CIC-IDS-2017 benchmark (needs dataset)
python ml/preprocess_cicids.py --input data/cicids2017/MachineLearningCVE/
python ml/evaluate_cicids.py --data flows_cicids.csv
python ml/evaluate_anomaly_cicids.py --data flows_cicids.csvValidated on CIC-IDS-2017 — a public benchmark dataset containing 2,830,743 real network flows with labelled benign and attack traffic.
| Metric | Score |
|---|---|
| Test Accuracy | 98.95% |
| Macro F1 Score | 84.73% |
| Weighted F1 | 99% |
| 5-Fold CV Macro-F1 | 81.07% ± 2.38% |
| Training Flows | 2,264,594 |
| Test Flows | 566,149 |
| Classes | 13 |
Per-Class Performance:
| Class | Precision | Recall | F1 | Support |
|---|---|---|---|---|
| DDoS | 100% | 100% | 100% | 25,606 |
| DNS | 100% | 100% | 100% | 191,563 |
| PortScan | 99% | 100% | 100% | 31,786 |
| HTTPS | 100% | 99% | 99% | 213,002 |
| DoS | 99% | 99% | 99% | 50,532 |
| SSH | 99% | 99% | 99% | 2,160 |
| SMTP | 98% | 99% | 98% | 756 |
| HTTP | 99% | 94% | 96% | 47,139 |
| BruteForce | 90% | 92% | 91% | 3,068 |
| Heartbleed | 100% | 100% | 100% | 2 |
| Infiltration | 100% | 71% | 83% | 7 |
| Botnet | 16% | 93% | 27%* | 393 |
| WebAttack | 5% | 78% | 9%* | 135 |
* Low precision due to severe class imbalance. Botnet and WebAttack represent <0.02% of traffic — a known limitation addressed by SMOTE oversampling in production.
Trained with zero attack labels. Evaluated post-hoc against ground-truth CIC-IDS-2017 labels.
| Metric | Score |
|---|---|
| ROC-AUC | 0.6254 |
| Detection Rate (Recall) | 21.2% |
| False Alarm Rate | 19.7% |
| Precision | 20.8% |
| Attack Rate in Dataset | 19.7% |
| Contamination Parameter | 0.20 |
Per-Attack Detection Rate:
| Attack Type | Detection Rate | Notes |
|---|---|---|
| Heartbleed | 100% | Highly anomalous packet structure |
| Infiltration | 91.7% | Rare + unusual behavior pattern |
| BruteForce | 46.6% | High packet rate detectable |
| DDoS | 46.1% | Volume anomaly detectable |
| DoS | 20.0% | Blends with high-traffic flows |
| PortScan | 0.7% | Designed to mimic normal TCP |
| Botnet | 3.0% | Specifically evades detection |
| WebAttack | 2.7% | Looks like normal HTTP traffic |
The Two-Layer Security Story:
Supervised (RF) → Known attacks → 98.95% accuracy ← "We know what to look for"
Unsupervised (IF) → Unknown attacks → 0.625 ROC-AUC ← "We can find new threats too"
This mirrors real-world IDS architectures: a supervised layer for known attack signatures and an anomaly layer for zero-day threats.
| Page | Description |
|---|---|
| Overview | Total flows, packets, bytes, blocked %, top apps bar chart, protocol split |
| Model Performance | CIC-IDS-2017 metrics, per-class F1 chart, confusion table |
| Anomaly Detector | Detection rates by attack type, score distribution |
| Traffic Map | Filterable flow table, top source IPs |
| Anomaly Detection | Score histogram, scatter plot, top anomalous flows |
| App Breakdown | Traffic share pie, bandwidth pie, duration box plots |
| AI Analyst | Natural language queries powered by Groq + Llama 3.1 70B |
"What was the most suspicious traffic in the last 10 minutes?"
"Which IPs were blocked and why?"
"Is there any sign of a port scan or DDoS?"
"Which app generated the most bandwidth?"
"Summarise all anomalies found."
Use generated/test captures and rerun the pipeline to refresh dashboard inputs:
python generate_test_pcap.py
./dpi_engine test_dpi.pcap output.pcap
python ml/anomaly_detector.py
streamlit run dashboard/app.py- Windows: MSYS2 with g++ | macOS/Linux: g++ or clang++
- Python 3.11+ for the ML pipeline
- No external C++ libraries required
All commands run from the project root (
Packet_analyzer/).
cmake -S . -B build
cmake --build build --config ReleaseGenerated binaries include packet_analyzer, dpi_working, dpi_engine, and dpi_mt.
# Basic (main multi-threaded pipeline)
./dpi_engine test_dpi.pcap output.pcap
# Block a single app
./dpi_engine test_dpi.pcap output.pcap --block-app YouTube
# Block multiple apps + IP + domain
./dpi_engine test_dpi.pcap output.pcap --block-app YouTube --block-app TikTok --block-ip 192.168.1.50 --block-domain facebook
# Configure thread count
./dpi_engine input.pcap output.pcap --lbs 4 --fps 4pip install -r requirements.txt
# Train classifier
python ml/train_classifier.py
# Run anomaly detection
python ml/anomaly_detector.py
# Launch dashboard
streamlit run dashboard/app.pyCreate a .env file in the project root (never commit this):
GROQ_API_KEY=gsk_your_key_here
Get a free Groq API key at console.groq.com.
python generate_test_pcap.pyThe repository intentionally excludes:
- large datasets
- generated artifacts
- trained model binaries
- PCAP captures
to comply with GitHub storage limits and maintain fast cloning/setup times.
Recommended workflow:
git clone <repo>
pip install -r requirements.txt
cmake -S . -B build
cmake --build build --config ReleaseThen run the pipeline locally using your own datasets or generated traffic.
╔══════════════════════════════════════════════════════════════╗
║ DPI ENGINE v2.0 (Multi-threaded) ║
╠══════════════════════════════════════════════════════════════╣
║ Load Balancers: 2 FPs per LB: 2 Total FPs: 4 ║
╚══════════════════════════════════════════════════════════════╝
[Rules] Blocked app: YouTube
[Rules] Blocked IP: 192.168.1.50
[Reader] Processing packets...
[Reader] Done reading 77 packets
╔══════════════════════════════════════════════════════════════╗
║ PROCESSING REPORT ║
╠══════════════════════════════════════════════════════════════╣
║ Total Packets: 77 ║
║ Total Bytes: 5738 ║
║ TCP Packets: 73 ║
║ UDP Packets: 4 ║
╠══════════════════════════════════════════════════════════════╣
║ Forwarded: 69 ║
║ Dropped: 8 ║
╠══════════════════════════════════════════════════════════════╣
║ THREAD STATISTICS ║
║ LB0 dispatched: 53 ║
║ FP0 processed: 53 ║
║ FP3 processed: 24 ║
╠══════════════════════════════════════════════════════════════╣
║ APPLICATION BREAKDOWN ║
╠══════════════════════════════════════════════════════════════╣
║ HTTPS 39 50.6% ########## ║
║ YouTube 4 5.2% # (BLOCKED) ║
║ DNS 4 5.2% # ║
╚══════════════════════════════════════════════════════════════╝
[Detected Domains/SNIs]
- www.youtube.com -> YouTube (BLOCKED)
- www.facebook.com -> Facebook
- www.google.com -> Google
| Section | Meaning |
|---|---|
| Configuration | Thread pool size |
| Rules | Active blocking rules |
| Total Packets | Read from input PCAP |
| Forwarded | Written to output PCAP |
| Dropped | Blocked by rules |
| Thread Statistics | Work distribution across LB/FP threads |
| Application Breakdown | Traffic classification by app |
| Detected SNIs | Domain names extracted from TLS handshakes |
| Item | Description |
|---|---|
| SMOTE Oversampling | Fix Botnet/WebAttack class imbalance → push Macro-F1 to 92%+ |
| Live libpcap | Capture from real network interface instead of PCAP files |
| Kubernetes Scaling | Horizontal scaling with k8s deployments for high-throughput environments |
| Grafana Integration | Replace Streamlit with production Grafana + InfluxDB dashboards |
| Item | Description |
|---|---|
| Kafka Streaming | Replace file-based flow export with real-time Kafka topic pipeline |
| SIEM Integration | Export alerts to Splunk / Elastic SIEM via CEF/syslog format |
| QUIC/HTTP3 Support | Detect traffic on UDP port 443 with QUIC Initial packet parsing |
| Distributed Processing | Multi-node packet processing with shared flow state via Redis |
| Item | Description |
|---|---|
| Transformer Traffic Analysis | Replace Random Forest with FlowTransformer / ET-BERT for sequence modeling |
| Online Learning | Incremental model updates without full retraining (River ML) |
| GPU Acceleration | CUDA-based packet parsing for 10Gbps+ line rate processing |
| Federated Learning | Train across multiple network nodes without centralising raw traffic |
| Cloud Deployment | AWS/GCP managed deployment with auto-scaling and CloudWatch metrics |
This platform demonstrates a complete AI/cybersecurity engineering stack:
| Skill | Implementation |
|---|---|
| Systems Programming | C++17 multi-threaded packet engine, manual binary parsing |
| Network Protocols | Ethernet/IP/TCP/UDP/TLS parsing, SNI extraction |
| Concurrent Programming | Thread pools, lock-free queues, atomic counters |
| ML Engineering | Feature engineering, Random Forest, Isolation Forest, cross-validation |
| AI Explainability | SHAP values for anomaly explanation |
| LLM Integration | Groq API + Llama 3.1 70B for natural language analytics |
| Data Engineering | C++ → CSV → Python bridge, real-world benchmark evaluation |
| Full-Stack | Streamlit dashboard with 5 interactive pages |
| DevOps | CMake-based build workflow |
| Research Validation | CIC-IDS-2017 benchmark, 2.83M flows, published metrics |
The key insight: even HTTPS traffic leaks the destination domain in the TLS handshake, enabling identification and control of application usage — and that signal, combined with flow-level statistics, is powerful enough to train production-grade ML models.
Built with C++17 · Python 3.11 · scikit-learn · Streamlit · Groq
If this project helped you, consider giving it a ⭐





