Threat Infrastructure Graphing Stack

neo4j graph visualization of CT log-derived infrastructure

An internet-scale threat reconnaissance and infrastructure graphing pipeline. This stack discovers domains from global Certificate Transparency logs in real-time, resolves them, fingerprints the underlying infrastructure, and builds a continuously updating relationship graph in Neo4j for cluster analysis.

Architecture Overview

       ┌──────────────────────────────┐
       │  Google CT Log Servers       │
       │  (Argon 2026h1/h2, Xenon)    │
       │  RFC 6962 HTTP API           │
       └──────────────┬───────────────┘
                      │ HTTP polling (256 entries/batch)
                      │
              ┌───────▼───────┐
              │  ct_stream.py │  Parse X.509 certs → extract domains
              └───────┬───────┘
                      │ LPUSH domain names
                      │
              ┌───────▼───────┐
              │  Redis Queue  │  Key: "targets" (list)
              └───────┬───────┘
                      │ BRPOP (blocking pop)
          ┌───────────┼───────────┐
          │           │           │
   ┌──────▼──┐ ┌─────-▼──┐ ┌─--───▼──┐
   │Worker 1 │ │Worker 2 │ │Worker 3 │   (parallel multiprocessing)
   └────┬────┘ └────┬────┘ └────┬────┘
        │           │           │
        │  For each target:     │
        │  ├─ DNS Resolution    │
        │  ├─ WHOIS → ASN       │
        │  ├─ TLS Handshake     │
        │  ├─ Favicon mmh3 Hash │
        │  ├─ HTTP Headers      │
        │  └─ Nmap Top 20 Ports │
        │           │           │
        └───────────┼───────────┘
                    │ MERGE (upsert)
            ┌───────▼───────┐
            │    Neo4j      │  Graph database
            │  (bolt:7687)  │  UI at :7474
            └───────────────┘

How Each Component Works

1. `main.py` — The Supervisor

This is the only script you run. It uses Python's multiprocessing module to spawn:

1 CT Stream process — polls Certificate Transparency logs
3 Worker processes — consume the Redis queue in parallel

All processes are managed together. Ctrl+C triggers a clean shutdown of every child process.

2. `ct_stream.py` — Certificate Transparency Poller

Instead of relying on third-party websocket services (like certstream, which is frequently down), we poll Google's CT log servers directly using the RFC 6962 HTTP API.

What are CT Logs?

Certificate Transparency is a global, append-only public ledger of every TLS certificate issued by a trusted Certificate Authority. Every CA (Let's Encrypt, DigiCert, Cloudflare, etc.) is required to submit certificates here before browsers will trust them.

How the poller works:

get-sth (Signed Tree Head) — Asks the log server "how many total entries exist?" This returns the tree_size.
Start position — We begin BACKLOG (500) entries before the tip, giving us an immediate batch of data to process.
get-entries — Fetches up to 256 raw certificate entries per request.
Parse X.509 — Each entry contains a DER-encoded certificate. We decode it using the cryptography library and extract:
- SAN (Subject Alternative Name) — the list of all domains the cert covers
- CN (Common Name) — fallback if no SAN exists
Strip wildcards — *.example.com becomes example.com
Push to Redis — Each domain is LPUSHed into the "targets" list for workers to consume.
Loop — Once caught up, the poller checks for new entries every 0.5 seconds.

Failover

Three CT log endpoints are configured. If one fails, the poller automatically falls through to the next:

Google Argon 2025h1 (US)
Google Argon 2025h2 (US)
Google Xenon 2025h1 (EU)

3. `worker.py` — The Recon Engine

Each worker process runs an infinite loop, blocking on BRPOP until a target appears in Redis. When a target arrives, the worker runs 6 recon modules against it:

Module 1: DNS Resolution

Domain → IP address

If the target from the CT stream is a domain name (not an IP), the worker resolves it via socket.gethostbyname(). The (domain) -[resolves_to]→ (ip) relationship is written to Neo4j. If resolution fails, the target is silently dropped.

Module 2: ASN & Organization Lookup

IP → Autonomous System Number + Hosting Provider

Runs the system whois command against the IP and parses the output. It extracts:

The ASN (origin field) identifying the network operator.
The Organization / Hosting Provider (OrgName, Organization, netname, or descr field) identifying the specific host (e.g., Amazon Technologies Inc., Google LLC).

Graph edges:

(ip) -[hosted_on]→ (asn)
(ip) -[hosted_by]→ (org)

Module 3: TLS Fingerprint

IP:443 → cipher suite + certificate issuer

Opens a raw SSL socket to port 443, performs a TLS handshake, and extracts:

The negotiated cipher suite (e.g., TLS_AES_256_GCM_SHA384)
The certificate issuer chain

Threat actors often reuse the same TLS configuration across their infrastructure. Shared cipher suites are a strong clustering signal.

Graph edge: (ip) -[tls_cipher]→ (cipher)

Module 4: Favicon Hash

IP → mmh3 hash of /favicon.ico

Fetches http(s)://<ip>/favicon.ico and computes its MurmurHash3. This is the same technique used by Shodan for favicon-based infrastructure discovery. Identical favicon hashes across different IPs strongly suggest shared infrastructure or the same web application deployment.

Graph edge: (ip) -[favicon_hash]→ (hash)

Module 5: HTTP Fingerprint

IP → Server header value

Makes a simple GET / request and extracts the Server response header (e.g., nginx, cloudflare, Apache/2.4.41). This identifies the web server software.

Graph edge: (ip) -[server]→ (server_name)

Module 6: Nmap Port Scan

IP → open ports + service versions

Runs nmap -Pn -sV --top-ports 20 to identify the top 20 most common open ports and their service versions. The output is parsed to extract only the open ports for clean display.

Note

Nmap is the slowest module by far (up to 60+ seconds per host for service detection). This is why we run 3 workers in parallel.

4. `pivots.py` — Cluster Analysis Queries

Pre-built Cypher queries for finding infrastructure clusters:

Shared Favicon Clusters — Find IPs that serve the same favicon (likely same operator)
Shared ASN + TLS — Find IPs on the same network using the same TLS configuration (high confidence clustering)

5. `seed.py` — Manual Target Injection

Pushes hardcoded IPs directly into the Redis queue, bypassing the CT stream entirely. Useful for targeted investigations.

Neo4j Graph Schema

All data is stored as a single node type (Entity) with relationships between them:

(:Entity {value, type})  -[:REL {type}]→  (:Entity {value, type})

Node Type	Example `value`	Description
`ip`	`172.67.189.133`	An IPv4 address
`domain`	`example.com`	A domain name from CT logs
`asn`	`origin: AS13335`	Autonomous System Number
`org`	`Amazon Technologies Inc.`	Hosting Provider / Organization
`tls`	`('TLS_AES_256_GCM_SHA384', ...)`	Negotiated TLS cipher suite
`favicon`	`277325061`	MurmurHash3 of favicon.ico
`http`	`cloudflare`	HTTP Server header value

Relationship	Meaning
`resolves_to`	Domain → IP (DNS)
`hosted_on`	IP → ASN (network operator)
`hosted_by`	IP → Hosting Provider (org)
`tls_cipher`	IP → TLS cipher suite
`favicon_hash`	IP → favicon mmh3 hash
`server`	IP → HTTP server software

Deduplication & Rescan Mode

Deduplication (default)

Every IP that gets fully scanned is added to a Redis set (processed). On subsequent runs:

CT stream skips domains whose IPs are already in the set
Workers skip IPs already in the set

This means you can freely stop and restart the stack without wasting time re-scanning the same hosts. The graph data in Neo4j is preserved, and only new, unseen targets get processed.

Rescan Mode (`--rescan`)

To check whether previously scanned infrastructure has changed (e.g., an IP moved to a different ASN, swapped TLS ciphers, or changed its HTTP server), run:

python main.py --rescan

This will:

Pull every IP from the processed set
Re-queue them all into Redis
For each IP, query Neo4j for existing data before scanning
Run all recon modules again
Compare old vs new values and report any differences
Auto-exit when the queue is drained

Example output

🔄 Re-scanning: 172.67.189.133
   ⚠ CHANGE [HTTP Server]
      old: nginx
      new: cloudflare
   ⚠ CHANGE [ASN]
      old: origin: AS15169
      new: origin: AS13335
   ⚡ 2 change(s) detected!

🔄 Re-scanning: 8.8.8.8
   ✓ No changes detected

CLI Flags

Flag	Description
`--target <host>`	Scan a specific domain or IP and add to the graph
`--rescan`	Re-scan all known targets for changes
`--workers N`	Number of parallel worker processes (default: 3)

Quick Start

# 1. Start Redis + Neo4j
docker compose up -d

# 2. Setup Python environment
uv venv && source .venv/bin/activate
uv pip install -r requirements.txt

# 3. Run the full stack (CT stream + 3 workers)
python main.py

# 4. Open Neo4j UI
open http://localhost:7474
# Login: neo4j / password

# 5. Later — check for infrastructure changes
python main.py --rescan

File Map

File	Purpose
`main.py`	Supervisor — spawns all processes, run this
`ct_stream.py`	Polls CT logs, discovers domains, feeds Redis
`worker.py`	Consumes Redis, runs 6 recon modules, writes to Neo4j
`pivots.py`	Pre-built Cypher queries for cluster analysis
`seed.py`	Manually inject IPs into the queue
`docker-compose.yml`	Redis + Neo4j containers
`requirements.txt`	Python dependencies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Threat Infrastructure Graphing Stack

Architecture Overview

How Each Component Works

1. `main.py` — The Supervisor

2. `ct_stream.py` — Certificate Transparency Poller

What are CT Logs?

How the poller works:

Failover

3. `worker.py` — The Recon Engine

Module 1: DNS Resolution

Module 2: ASN & Organization Lookup

Module 3: TLS Fingerprint

Module 4: Favicon Hash

Module 5: HTTP Fingerprint

Module 6: Nmap Port Scan

4. `pivots.py` — Cluster Analysis Queries

5. `seed.py` — Manual Target Injection

Neo4j Graph Schema

Deduplication & Rescan Mode

Deduplication (default)

Rescan Mode (`--rescan`)

Example output

CLI Flags

Quick Start

File Map

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.gitignore		.gitignore
README.md		README.md
ct_stream.py		ct_stream.py
docker-compose.yml		docker-compose.yml
main.py		main.py
pivots.py		pivots.py
requirements.txt		requirements.txt
seed.py		seed.py
visualisation.png		visualisation.png
worker.py		worker.py

Folders and files

Latest commit

History

Repository files navigation

Threat Infrastructure Graphing Stack

Architecture Overview

How Each Component Works

1. main.py — The Supervisor

2. ct_stream.py — Certificate Transparency Poller

What are CT Logs?

How the poller works:

Failover

3. worker.py — The Recon Engine

Module 1: DNS Resolution

Module 2: ASN & Organization Lookup

Module 3: TLS Fingerprint

Module 4: Favicon Hash

Module 5: HTTP Fingerprint

Module 6: Nmap Port Scan

4. pivots.py — Cluster Analysis Queries

5. seed.py — Manual Target Injection

Neo4j Graph Schema

Deduplication & Rescan Mode

Deduplication (default)

Rescan Mode (--rescan)

Example output

CLI Flags

Quick Start

File Map

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. `main.py` — The Supervisor

2. `ct_stream.py` — Certificate Transparency Poller

3. `worker.py` — The Recon Engine

4. `pivots.py` — Cluster Analysis Queries

5. `seed.py` — Manual Target Injection

Rescan Mode (`--rescan`)

Packages