An internet-scale threat reconnaissance and infrastructure graphing pipeline. This stack discovers domains from global Certificate Transparency logs in real-time, resolves them, fingerprints the underlying infrastructure, and builds a continuously updating relationship graph in Neo4j for cluster analysis.
┌──────────────────────────────┐
│ Google CT Log Servers │
│ (Argon 2026h1/h2, Xenon) │
│ RFC 6962 HTTP API │
└──────────────┬───────────────┘
│ HTTP polling (256 entries/batch)
│
┌───────▼───────┐
│ ct_stream.py │ Parse X.509 certs → extract domains
└───────┬───────┘
│ LPUSH domain names
│
┌───────▼───────┐
│ Redis Queue │ Key: "targets" (list)
└───────┬───────┘
│ BRPOP (blocking pop)
┌───────────┼───────────┐
│ │ │
┌──────▼──┐ ┌─────-▼──┐ ┌─--───▼──┐
│Worker 1 │ │Worker 2 │ │Worker 3 │ (parallel multiprocessing)
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
│ For each target: │
│ ├─ DNS Resolution │
│ ├─ WHOIS → ASN │
│ ├─ TLS Handshake │
│ ├─ Favicon mmh3 Hash │
│ ├─ HTTP Headers │
│ └─ Nmap Top 20 Ports │
│ │ │
└───────────┼───────────┘
│ MERGE (upsert)
┌───────▼───────┐
│ Neo4j │ Graph database
│ (bolt:7687) │ UI at :7474
└───────────────┘
This is the only script you run. It uses Python's multiprocessing module to spawn:
- 1 CT Stream process — polls Certificate Transparency logs
- 3 Worker processes — consume the Redis queue in parallel
All processes are managed together. Ctrl+C triggers a clean shutdown of every child process.
Instead of relying on third-party websocket services (like certstream, which is frequently down), we poll Google's CT log servers directly using the RFC 6962 HTTP API.
Certificate Transparency is a global, append-only public ledger of every TLS certificate issued by a trusted Certificate Authority. Every CA (Let's Encrypt, DigiCert, Cloudflare, etc.) is required to submit certificates here before browsers will trust them.
get-sth(Signed Tree Head) — Asks the log server "how many total entries exist?" This returns thetree_size.- Start position — We begin
BACKLOG(500) entries before the tip, giving us an immediate batch of data to process. get-entries— Fetches up to 256 raw certificate entries per request.- Parse X.509 — Each entry contains a DER-encoded certificate. We decode it using the
cryptographylibrary and extract:- SAN (Subject Alternative Name) — the list of all domains the cert covers
- CN (Common Name) — fallback if no SAN exists
- Strip wildcards —
*.example.combecomesexample.com - Push to Redis — Each domain is
LPUSHed into the"targets"list for workers to consume. - Loop — Once caught up, the poller checks for new entries every 0.5 seconds.
Three CT log endpoints are configured. If one fails, the poller automatically falls through to the next:
- Google Argon 2025h1 (US)
- Google Argon 2025h2 (US)
- Google Xenon 2025h1 (EU)
Each worker process runs an infinite loop, blocking on BRPOP until a target appears in Redis. When a target arrives, the worker runs 6 recon modules against it:
Domain → IP address
If the target from the CT stream is a domain name (not an IP), the worker resolves it via socket.gethostbyname(). The (domain) -[resolves_to]→ (ip) relationship is written to Neo4j. If resolution fails, the target is silently dropped.
IP → Autonomous System Number + Hosting Provider
Runs the system whois command against the IP and parses the output. It extracts:
- The ASN (
originfield) identifying the network operator. - The Organization / Hosting Provider (
OrgName,Organization,netname, ordescrfield) identifying the specific host (e.g.,Amazon Technologies Inc.,Google LLC).
Graph edges:
(ip) -[hosted_on]→ (asn)(ip) -[hosted_by]→ (org)
IP:443 → cipher suite + certificate issuer
Opens a raw SSL socket to port 443, performs a TLS handshake, and extracts:
- The negotiated cipher suite (e.g.,
TLS_AES_256_GCM_SHA384) - The certificate issuer chain
Threat actors often reuse the same TLS configuration across their infrastructure. Shared cipher suites are a strong clustering signal.
Graph edge: (ip) -[tls_cipher]→ (cipher)
IP → mmh3 hash of /favicon.ico
Fetches http(s)://<ip>/favicon.ico and computes its MurmurHash3. This is the same technique used by Shodan for favicon-based infrastructure discovery. Identical favicon hashes across different IPs strongly suggest shared infrastructure or the same web application deployment.
Graph edge: (ip) -[favicon_hash]→ (hash)
IP → Server header value
Makes a simple GET / request and extracts the Server response header (e.g., nginx, cloudflare, Apache/2.4.41). This identifies the web server software.
Graph edge: (ip) -[server]→ (server_name)
IP → open ports + service versions
Runs nmap -Pn -sV --top-ports 20 to identify the top 20 most common open ports and their service versions. The output is parsed to extract only the open ports for clean display.
Note
Nmap is the slowest module by far (up to 60+ seconds per host for service detection). This is why we run 3 workers in parallel.
Pre-built Cypher queries for finding infrastructure clusters:
- Shared Favicon Clusters — Find IPs that serve the same favicon (likely same operator)
- Shared ASN + TLS — Find IPs on the same network using the same TLS configuration (high confidence clustering)
Pushes hardcoded IPs directly into the Redis queue, bypassing the CT stream entirely. Useful for targeted investigations.
All data is stored as a single node type (Entity) with relationships between them:
(:Entity {value, type}) -[:REL {type}]→ (:Entity {value, type})
| Node Type | Example value |
Description |
|---|---|---|
ip |
172.67.189.133 |
An IPv4 address |
domain |
example.com |
A domain name from CT logs |
asn |
origin: AS13335 |
Autonomous System Number |
org |
Amazon Technologies Inc. |
Hosting Provider / Organization |
tls |
('TLS_AES_256_GCM_SHA384', ...) |
Negotiated TLS cipher suite |
favicon |
277325061 |
MurmurHash3 of favicon.ico |
http |
cloudflare |
HTTP Server header value |
| Relationship | Meaning |
|---|---|
resolves_to |
Domain → IP (DNS) |
hosted_on |
IP → ASN (network operator) |
hosted_by |
IP → Hosting Provider (org) |
tls_cipher |
IP → TLS cipher suite |
favicon_hash |
IP → favicon mmh3 hash |
server |
IP → HTTP server software |
Every IP that gets fully scanned is added to a Redis set (processed). On subsequent runs:
- CT stream skips domains whose IPs are already in the set
- Workers skip IPs already in the set
This means you can freely stop and restart the stack without wasting time re-scanning the same hosts. The graph data in Neo4j is preserved, and only new, unseen targets get processed.
To check whether previously scanned infrastructure has changed (e.g., an IP moved to a different ASN, swapped TLS ciphers, or changed its HTTP server), run:
python main.py --rescanThis will:
- Pull every IP from the
processedset - Re-queue them all into Redis
- For each IP, query Neo4j for existing data before scanning
- Run all recon modules again
- Compare old vs new values and report any differences
- Auto-exit when the queue is drained
🔄 Re-scanning: 172.67.189.133
⚠ CHANGE [HTTP Server]
old: nginx
new: cloudflare
⚠ CHANGE [ASN]
old: origin: AS15169
new: origin: AS13335
⚡ 2 change(s) detected!
🔄 Re-scanning: 8.8.8.8
✓ No changes detected
| Flag | Description |
|---|---|
--target <host> |
Scan a specific domain or IP and add to the graph |
--rescan |
Re-scan all known targets for changes |
--workers N |
Number of parallel worker processes (default: 3) |
# 1. Start Redis + Neo4j
docker compose up -d
# 2. Setup Python environment
uv venv && source .venv/bin/activate
uv pip install -r requirements.txt
# 3. Run the full stack (CT stream + 3 workers)
python main.py
# 4. Open Neo4j UI
open http://localhost:7474
# Login: neo4j / password
# 5. Later — check for infrastructure changes
python main.py --rescan| File | Purpose |
|---|---|
main.py |
Supervisor — spawns all processes, run this |
ct_stream.py |
Polls CT logs, discovers domains, feeds Redis |
worker.py |
Consumes Redis, runs 6 recon modules, writes to Neo4j |
pivots.py |
Pre-built Cypher queries for cluster analysis |
seed.py |
Manually inject IPs into the queue |
docker-compose.yml |
Redis + Neo4j containers |
requirements.txt |
Python dependencies |
