Build a comprehensive, high-precision dossier on a person — usually yourself — by recursive, pagerank-style deep-web research. Orchestrated from Claude Code; thin Python scripts do the per-source work.
What you give: a name and a few grounding facts (schools, employers, handles).
What you get: an Obsidian-wikilinked catalog of everything public about you — papers, talks, podcasts, articles, blog posts, social handles, Wayback versions of old sites — with provenance hashes back to source artifacts.
Why bother: personal-brand audit, CV / About-page prep, finding things you forgot about, catching same-name impersonators, seeing your full public footprint at a glance before someone else does.
git clone https://github.com/sc932/content_finder
cd content_finder
claudeThat's it. Claude opens by asking who you're researching and your relationship to them, then walks you through seeding the dossier conversationally — no manual file editing, no remembering commands. Discrete choices come as taps; open questions come as inline text.
A redacted worked example — the maintainer's own dossier, the result of dozens of recursive passes — ships in this repo as the canonical demo.
Three screenshots / a short GIF will go here once captured: (1) the opening interview, (2) sweeps + parse-reviews running, (3) scrolling through the final
dossier/*.mdcatalog. The text below approximates the first of those.
$ claude
You: hi
Claude: Hi! I'm content_finder. I help build a comprehensive,
high-precision dossier on a person — articles, papers, talks,
podcasts, social handles, and Wayback versions of anything
they've published online.
Two quick questions to scope what we're doing:
1. Who are we building a dossier on? (Just type the name.)
And — [AskUserQuestion] — 2. What's your relationship to them?
[ ] Myself (Recommended)
[ ] Public figure (journalism, due diligence, biography)
[ ] Family or associate (with explicit consent)
[ ] Other (will need SCOPE.md)
You: Jane Doe
Myself
Claude: Got it — your self-dossier. Let's seed the fingerprint with a
few grounding facts. Where did you go to college, and what did
you study?
You: UC Berkeley, BS Math, 2008-2012
Claude: Added Berkeley + BS Math 2008–2012 to fingerprint. Any graduate
work or further degrees?
...
Claude: Fingerprint seeded — 4 universities, 3 employers, 6 social
handles, 2 personal domains.
[AskUserQuestion] Ready to run the first sweep?
[ ] All default sweeps (Recommended; ~10 min)
[ ] Just scholarly (OpenAlex + Semantic Scholar; fastest)
[ ] Just web presence (Wayback + GitHub + handle-probe)
[ ] I'll choose specifically
From there, Claude runs the chosen sweeps, parse-reviews each ambiguous hit with you ("is this paper yours? this podcast guest spot? this same-name BMX racer?"), catalogs findings into dossier/*.md with citations, and proposes the next concrete step at the end of every turn. Stop whenever; come back later; run claude again — it picks up from the log + queues + fingerprint exactly where it left off.
Pass-based, checkpointed sessions. Each claude invocation is one pass. State lives on disk — frontier queue, fingerprint, findings, dossier. Nothing depends on session memory. You can stop and resume across days or weeks.
Three kinds of state:
- Subject identity —
subject/fingerprint.yaml— structured facts that grow over time and serve as the disambiguation fingerprint scored against every candidate hit. - Process state —
frontier/queue.md(priority-queued leads),tasks/handoffs.md(things needing the human),tasks/interviews.md(open questions for the human). - Findings —
findings/raw/(content-addressed blobs of fetched HTML/PDF/audio),findings/by-source/*.md(per-source working notes),dossier/*.md(synthesized catalog with citations back to artifact hashes).
Recursive frontier with decaying scores. Discovered URLs and entities go on the frontier with a score and a depth. Each pass pops the top-scored items, fetches them, scores newly-discovered links, decays older scores, and prunes the tail. Pagerank-flavored — the more independent paths to a thing, the higher its score.
Disambiguation by fingerprint. Every hit gets scored against fingerprint.yaml (affiliations, date windows, co-authors, topics, geography). Strong match = auto-accept. Ambiguous = filed as an interview question. Bad match = rejected with a logged reason. Critical for common names.
Provenance required. Every dossier claim cites a hashed, timestamped artifact + Wayback fallback URL. Links rot; hashes don't.
Handoffs and interviews are first-class. When the tool can't proceed (paywall, CAPTCHA, ambiguous attribution), it doesn't guess and doesn't silently skip — it files a task and raises it conversationally on the next pass. You answer; it ingests; it moves on.
Self-default; non-self requires SCOPE.md. No friction running on yourself. Running on someone else requires a committed SCOPE.md documenting consent / legitimate-interest justification / public-figure framing — the tool refuses without one.
Your data stays local. All working dirs are gitignored. Nothing of yours pushes to the public repo. The committed example_dossier/ is the only dossier-shaped content in the tree.
Requirements:
- Python 3.11+
- Claude Code CLI installed and configured
- Optional: paid-API keys in
.env(see.env.example) — every paid path is gated behind an env var; free-tier sweeps run without any.
git clone https://github.com/sc932/content_finder
cd content_finder
cp .env.example .env # leave keys blank you don't have
pip install -r requirements.txt # only needed if you'll run transcriptionMost scripts are stdlib-only. The exceptions:
scripts/transcribeneedsfaster-whisper(inrequirements.txt). Pulls a model on first run; can use CPU or GPU.scripts/youtube-searchandscripts/acquire-audioshell out toyt-dlp— install viapip install yt-dlpor system package.acquire-audioalso requiresffmpeg(e.g.,apt install ffmpeg) for audio extraction.
The cataloging-, fingerprint-, and dossier-only flows don't need any of the above — install on demand as you reach those workflows.
The Quick start above covers the happy path. A few longer-running flows worth flagging:
Subsequent sessions. Run claude again from the same directory. It loads the existing subject/fingerprint.yaml, surfaces any pending handoffs / interviews via AskUserQuestion, and proposes the next pass.
Dropping data exports. During the seed interview Claude offers to ingest a resume / CV PDF, LinkedIn GDPR export, Twitter/X archive, Google Takeout, or GitHub data export. When you opt in, Claude gives you a precise drop path (e.g., "drop your resume at inputs/resume.pdf, then say 'ready'"); the file stays gitignored on your local disk.
Exporting the dossier.
./scripts/export --target ../my_dossierEmits a self-contained folder with the catalog markdown (Obsidian-wikilinked), the structured fingerprint, and a _MAP.md index written for both humans and downstream LLM agents. See example_dossier/ for what the output looks like.
example_dossier/ is a redacted snapshot of the maintainer's own dossier — the output of running scripts/export against the maintainer's full-fidelity local working state, then applying a public-share PII strip. It demonstrates what dozens of recursive passes can produce, and gives the maintainer a real indexable web-footprint.
Stripped before publishing: birthplace, hometown, secondary schooling, formative-teacher names, parents' details, personal email addresses, pseudonymous personal/forum handles. Kept: birth year, public co-authors / academic mentors, employment, papers, talks, awards, public profile handles (GitHub, LinkedIn, Twitter, Facebook). The full unredacted record lives only in the maintainer's local working state.
Internal IDs explained. References like HANDOFF-0001, FRONT-0010, SC-V-0017, INTEL-V-0002 you'll see throughout the dossier are tracking IDs from the live working state — handoff queue items, frontier items, source citations in findings/by-source/*.md. The files they point to are gitignored and don't ship in the public repo. They're audit-trail breadcrumbs showing the depth of the tool's tracking; from a public reader's perspective they're dead pointers. Your own dossier accumulates the same kind of IDs in your own gitignored working state.
Replacing the example with your own. Run scripts/export --target ./example_dossier --force against your working state, apply whatever redaction policy you want, and commit. The committed example is overwritten on each export.
content_finder/
├── README.md — this file
├── AGENTS.md — operating manual (read by Claude every session)
├── MAINTAINING.md — dev notes (build status, backlog, contributing)
├── CLAUDE.md — pointer to AGENTS.md
├── BACKLOG.md
├── LICENSE — MIT
├── SCOPE.md.example — template for non-self subject pre-flight (copy to SCOPE.md and fill in)
├── requirements.txt — Python deps (faster-whisper for transcription)
├── .env.example — paid-API key template
├── scripts/ — Python executables (each with --help and stdin/stdout JSON I/O)
├── workflows/ — multi-script pipeline docs (one per pipeline)
├── agents/ — Claude Code subagent definitions for per-source sweeps
└── example_dossier/ — committed redacted worked example
When a pass runs, the tool creates and writes to these gitignored working dirs:
├── subject/ — fingerprint.yaml, the canonical identity
├── frontier/ — priority-queued leads
├── tasks/ — handoffs.md, interviews.md (Claude-managed state)
├── findings/ — raw/ blobs + by-source/ notes
├── dossier/ — work-in-progress catalog
├── coverage/ — leak-detection probes + counters
├── log/ — append-only pass history
└── inputs/ — drop GDPR exports, resumes, hints.md here
A pass runs a budgeted subset of these — not all every time:
| Workflow | Trigger | What it does |
|---|---|---|
| Seed | First run, or new files in inputs/ |
Interview the human + parse any dropped resume / GDPR / Takeout files into fingerprint.yaml + initial frontier |
| Sweep | Every pass | Per-source subagents — OpenAlex, Semantic Scholar, Wayback CDX, podcast indices, YouTube, GitHub, Sherlock + WhatsMyName |
| Frontier | Every pass | Pop top-scored items, fetch, score discovered links/entities, decay older scores |
| Detective | User drops a fuzzy-recall hint into inputs/hints.md |
Targeted hunt — dorking, Wayback, news archives, paywalled-handoff queueing |
| Disambiguate | Hit score against fingerprint is ambiguous | Rule-based scoring; below threshold → filed as interview question |
| Acquire | Item passes threshold | Download via yt-dlp / gallery-dl / waybackpack / requests; SHA-256; store provenance |
| Transcribe | Audio/video acquired | faster-whisper; store segments + full transcript |
| Catalog | Enough new verified artifacts | Update dossier/*.md with citations back to artifact hashes |
| Coverage | Periodic | Negative probes, capture-recapture estimation, confidence reporting |
| Export | User says "export" | Emit self-contained dossier folder via scripts/export |
Each non-trivial workflow has its own doc in workflows/.
The tool distinguishes precision (each entry is correct) from recall (the list is exhaustive). Every list-field in fingerprint.yaml carries a _coverage.<field>.status tag — comprehensive-as-of:<date> (confirmed exhaustive on that date) or known-partial (more entries may exist that haven't been surfaced).
Don't infer completeness from a known-partial list. The absence of evidence ≠ evidence of absence; the tool surfaces this distinction explicitly so downstream consumers (you, future Claude sessions, agents reading the export) don't accidentally treat partial inventories as complete.
Negative probes in coverage/probes.md seed known-existing artifacts; if a sweep doesn't re-surface a probe, the pipeline is leaking and that's investigated before trusting more findings.
- Self-subject default. No restrictions beyond site ToS.
- Other subjects require
SCOPE.md. Tool refuses to run without one — documenting consent or legitimate-interest justification or public-figure framing. CopySCOPE.md.exampletoSCOPE.mdand fill it in before starting. - Never automates logins to auth-walled sites. Cookies and sessions are not shared with the tool — those go through the handoff queue (e.g., LinkedIn post feed, paywalled articles) so your accounts are never at risk of platform bans.
- No doxing of third parties. Home addresses, phone numbers, family info are never surfaced for non-consenting individuals — even when findable.
- Even self-research leaks intent patterns. Consider a separate browser profile / VPN if you care about that.
Pre-v0.1. The tool is actively maintained on the maintainer's own dossier as the development driver — workflows are documented, scripts are runnable, the export feature works, the cataloging is producing real output. The seed / cleanup / contributor experience for other subjects is still rough; expect rough edges if you're the second person to run this.
See MAINTAINING.md for the staged rollout plan (v0.1 → v2) and current backlog.