Skip to content

sc932/content_finder

Repository files navigation

content_finder

Build a comprehensive, high-precision dossier on a person — usually yourself — by recursive, pagerank-style deep-web research. Orchestrated from Claude Code; thin Python scripts do the per-source work.

What you give: a name and a few grounding facts (schools, employers, handles).

What you get: an Obsidian-wikilinked catalog of everything public about you — papers, talks, podcasts, articles, blog posts, social handles, Wayback versions of old sites — with provenance hashes back to source artifacts.

Why bother: personal-brand audit, CV / About-page prep, finding things you forgot about, catching same-name impersonators, seeing your full public footprint at a glance before someone else does.

Quick start

git clone https://github.com/sc932/content_finder
cd content_finder
claude

That's it. Claude opens by asking who you're researching and your relationship to them, then walks you through seeding the dossier conversationally — no manual file editing, no remembering commands. Discrete choices come as taps; open questions come as inline text.

A redacted worked example — the maintainer's own dossier, the result of dozens of recursive passes — ships in this repo as the canonical demo.

What a fresh session looks like

Three screenshots / a short GIF will go here once captured: (1) the opening interview, (2) sweeps + parse-reviews running, (3) scrolling through the final dossier/*.md catalog. The text below approximates the first of those.

$ claude

You:    hi

Claude: Hi! I'm content_finder. I help build a comprehensive,
        high-precision dossier on a person — articles, papers, talks,
        podcasts, social handles, and Wayback versions of anything
        they've published online.

        Two quick questions to scope what we're doing:

        1. Who are we building a dossier on? (Just type the name.)

        And — [AskUserQuestion] — 2. What's your relationship to them?
            [ ] Myself                (Recommended)
            [ ] Public figure         (journalism, due diligence, biography)
            [ ] Family or associate   (with explicit consent)
            [ ] Other                 (will need SCOPE.md)

You:    Jane Doe
        Myself

Claude: Got it — your self-dossier. Let's seed the fingerprint with a
        few grounding facts. Where did you go to college, and what did
        you study?

You:    UC Berkeley, BS Math, 2008-2012

Claude: Added Berkeley + BS Math 2008–2012 to fingerprint. Any graduate
        work or further degrees?

...

Claude: Fingerprint seeded — 4 universities, 3 employers, 6 social
        handles, 2 personal domains.

        [AskUserQuestion] Ready to run the first sweep?
            [ ] All default sweeps   (Recommended; ~10 min)
            [ ] Just scholarly       (OpenAlex + Semantic Scholar; fastest)
            [ ] Just web presence    (Wayback + GitHub + handle-probe)
            [ ] I'll choose specifically

From there, Claude runs the chosen sweeps, parse-reviews each ambiguous hit with you ("is this paper yours? this podcast guest spot? this same-name BMX racer?"), catalogs findings into dossier/*.md with citations, and proposes the next concrete step at the end of every turn. Stop whenever; come back later; run claude again — it picks up from the log + queues + fingerprint exactly where it left off.

How it works

Pass-based, checkpointed sessions. Each claude invocation is one pass. State lives on disk — frontier queue, fingerprint, findings, dossier. Nothing depends on session memory. You can stop and resume across days or weeks.

Three kinds of state:

  • Subject identitysubject/fingerprint.yaml — structured facts that grow over time and serve as the disambiguation fingerprint scored against every candidate hit.
  • Process statefrontier/queue.md (priority-queued leads), tasks/handoffs.md (things needing the human), tasks/interviews.md (open questions for the human).
  • Findingsfindings/raw/ (content-addressed blobs of fetched HTML/PDF/audio), findings/by-source/*.md (per-source working notes), dossier/*.md (synthesized catalog with citations back to artifact hashes).

Recursive frontier with decaying scores. Discovered URLs and entities go on the frontier with a score and a depth. Each pass pops the top-scored items, fetches them, scores newly-discovered links, decays older scores, and prunes the tail. Pagerank-flavored — the more independent paths to a thing, the higher its score.

Disambiguation by fingerprint. Every hit gets scored against fingerprint.yaml (affiliations, date windows, co-authors, topics, geography). Strong match = auto-accept. Ambiguous = filed as an interview question. Bad match = rejected with a logged reason. Critical for common names.

Provenance required. Every dossier claim cites a hashed, timestamped artifact + Wayback fallback URL. Links rot; hashes don't.

Handoffs and interviews are first-class. When the tool can't proceed (paywall, CAPTCHA, ambiguous attribution), it doesn't guess and doesn't silently skip — it files a task and raises it conversationally on the next pass. You answer; it ingests; it moves on.

Self-default; non-self requires SCOPE.md. No friction running on yourself. Running on someone else requires a committed SCOPE.md documenting consent / legitimate-interest justification / public-figure framing — the tool refuses without one.

Your data stays local. All working dirs are gitignored. Nothing of yours pushes to the public repo. The committed example_dossier/ is the only dossier-shaped content in the tree.

Install

Requirements:

  • Python 3.11+
  • Claude Code CLI installed and configured
  • Optional: paid-API keys in .env (see .env.example) — every paid path is gated behind an env var; free-tier sweeps run without any.
git clone https://github.com/sc932/content_finder
cd content_finder
cp .env.example .env                  # leave keys blank you don't have
pip install -r requirements.txt       # only needed if you'll run transcription

Most scripts are stdlib-only. The exceptions:

  • scripts/transcribe needs faster-whisper (in requirements.txt). Pulls a model on first run; can use CPU or GPU.
  • scripts/youtube-search and scripts/acquire-audio shell out to yt-dlp — install via pip install yt-dlp or system package. acquire-audio also requires ffmpeg (e.g., apt install ffmpeg) for audio extraction.

The cataloging-, fingerprint-, and dossier-only flows don't need any of the above — install on demand as you reach those workflows.

Usage details

The Quick start above covers the happy path. A few longer-running flows worth flagging:

Subsequent sessions. Run claude again from the same directory. It loads the existing subject/fingerprint.yaml, surfaces any pending handoffs / interviews via AskUserQuestion, and proposes the next pass.

Dropping data exports. During the seed interview Claude offers to ingest a resume / CV PDF, LinkedIn GDPR export, Twitter/X archive, Google Takeout, or GitHub data export. When you opt in, Claude gives you a precise drop path (e.g., "drop your resume at inputs/resume.pdf, then say 'ready'"); the file stays gitignored on your local disk.

Exporting the dossier.

./scripts/export --target ../my_dossier

Emits a self-contained folder with the catalog markdown (Obsidian-wikilinked), the structured fingerprint, and a _MAP.md index written for both humans and downstream LLM agents. See example_dossier/ for what the output looks like.

About the worked example

example_dossier/ is a redacted snapshot of the maintainer's own dossier — the output of running scripts/export against the maintainer's full-fidelity local working state, then applying a public-share PII strip. It demonstrates what dozens of recursive passes can produce, and gives the maintainer a real indexable web-footprint.

Stripped before publishing: birthplace, hometown, secondary schooling, formative-teacher names, parents' details, personal email addresses, pseudonymous personal/forum handles. Kept: birth year, public co-authors / academic mentors, employment, papers, talks, awards, public profile handles (GitHub, LinkedIn, Twitter, Facebook). The full unredacted record lives only in the maintainer's local working state.

Internal IDs explained. References like HANDOFF-0001, FRONT-0010, SC-V-0017, INTEL-V-0002 you'll see throughout the dossier are tracking IDs from the live working state — handoff queue items, frontier items, source citations in findings/by-source/*.md. The files they point to are gitignored and don't ship in the public repo. They're audit-trail breadcrumbs showing the depth of the tool's tracking; from a public reader's perspective they're dead pointers. Your own dossier accumulates the same kind of IDs in your own gitignored working state.

Replacing the example with your own. Run scripts/export --target ./example_dossier --force against your working state, apply whatever redaction policy you want, and commit. The committed example is overwritten on each export.

Repo layout

content_finder/
├── README.md           — this file
├── AGENTS.md           — operating manual (read by Claude every session)
├── MAINTAINING.md      — dev notes (build status, backlog, contributing)
├── CLAUDE.md           — pointer to AGENTS.md
├── BACKLOG.md
├── LICENSE             — MIT
├── SCOPE.md.example    — template for non-self subject pre-flight (copy to SCOPE.md and fill in)
├── requirements.txt    — Python deps (faster-whisper for transcription)
├── .env.example        — paid-API key template
├── scripts/            — Python executables (each with --help and stdin/stdout JSON I/O)
├── workflows/          — multi-script pipeline docs (one per pipeline)
├── agents/             — Claude Code subagent definitions for per-source sweeps
└── example_dossier/    — committed redacted worked example

When a pass runs, the tool creates and writes to these gitignored working dirs:

├── subject/            — fingerprint.yaml, the canonical identity
├── frontier/           — priority-queued leads
├── tasks/              — handoffs.md, interviews.md (Claude-managed state)
├── findings/           — raw/ blobs + by-source/ notes
├── dossier/            — work-in-progress catalog
├── coverage/           — leak-detection probes + counters
├── log/                — append-only pass history
└── inputs/             — drop GDPR exports, resumes, hints.md here

Workflows

A pass runs a budgeted subset of these — not all every time:

Workflow Trigger What it does
Seed First run, or new files in inputs/ Interview the human + parse any dropped resume / GDPR / Takeout files into fingerprint.yaml + initial frontier
Sweep Every pass Per-source subagents — OpenAlex, Semantic Scholar, Wayback CDX, podcast indices, YouTube, GitHub, Sherlock + WhatsMyName
Frontier Every pass Pop top-scored items, fetch, score discovered links/entities, decay older scores
Detective User drops a fuzzy-recall hint into inputs/hints.md Targeted hunt — dorking, Wayback, news archives, paywalled-handoff queueing
Disambiguate Hit score against fingerprint is ambiguous Rule-based scoring; below threshold → filed as interview question
Acquire Item passes threshold Download via yt-dlp / gallery-dl / waybackpack / requests; SHA-256; store provenance
Transcribe Audio/video acquired faster-whisper; store segments + full transcript
Catalog Enough new verified artifacts Update dossier/*.md with citations back to artifact hashes
Coverage Periodic Negative probes, capture-recapture estimation, confidence reporting
Export User says "export" Emit self-contained dossier folder via scripts/export

Each non-trivial workflow has its own doc in workflows/.

Confidence + coverage

The tool distinguishes precision (each entry is correct) from recall (the list is exhaustive). Every list-field in fingerprint.yaml carries a _coverage.<field>.status tag — comprehensive-as-of:<date> (confirmed exhaustive on that date) or known-partial (more entries may exist that haven't been surfaced).

Don't infer completeness from a known-partial list. The absence of evidence ≠ evidence of absence; the tool surfaces this distinction explicitly so downstream consumers (you, future Claude sessions, agents reading the export) don't accidentally treat partial inventories as complete.

Negative probes in coverage/probes.md seed known-existing artifacts; if a sweep doesn't re-surface a probe, the pipeline is leaking and that's investigated before trusting more findings.

Legal + ethical

  • Self-subject default. No restrictions beyond site ToS.
  • Other subjects require SCOPE.md. Tool refuses to run without one — documenting consent or legitimate-interest justification or public-figure framing. Copy SCOPE.md.example to SCOPE.md and fill it in before starting.
  • Never automates logins to auth-walled sites. Cookies and sessions are not shared with the tool — those go through the handoff queue (e.g., LinkedIn post feed, paywalled articles) so your accounts are never at risk of platform bans.
  • No doxing of third parties. Home addresses, phone numbers, family info are never surfaced for non-consenting individuals — even when findable.
  • Even self-research leaks intent patterns. Consider a separate browser profile / VPN if you care about that.

Status

Pre-v0.1. The tool is actively maintained on the maintainer's own dossier as the development driver — workflows are documented, scripts are runnable, the export feature works, the cataloging is producing real output. The seed / cleanup / contributor experience for other subjects is still rough; expect rough edges if you're the second person to run this.

See MAINTAINING.md for the staged rollout plan (v0.1 → v2) and current backlog.

About

web crawler for finding your online footprint with claude

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages