v1.0.0 — Winter 2024 → Winter 2026
First public release. Reproducible analysis pipeline covering every YC company from Winter 2024 through Winter 2026.
Scope
- 1,014 companies across 6 cohorts (W24, W25, Sp25, Su25, F25, W26)
- 924 AI-focused companies classified on 5 axes: end market × product layer × AI pattern × buyer × wedge archetype
- ~25 binary signals per company (mentions_agents, regulated_industry, services_angle, etc.)
- Founder-background signals extracted from bios (ex-big-tech, research, operator, domain expert, repeat founder, enterprise, regulated)
Headline findings
- AI is the default at YC. 86% → 93% cohort share.
- Vertical positioning rose 59% → 71%; horizontal fell 15% → 9%.
- The RAG era ended quietly.
rag-7.8pp,retrieval-6.3pp,fine-tuning-2.9pp,copilot-5.4pp. - Agents are real but partly rhetorical. Term
agent+16.2pp vs classifier Autonomous Agent pattern +11.8pp — ~4pp is re-labeled copilots. - AI-native service firms are emerging as a credible wedge. Replaces-outsourced-labor archetype +3.5pp.
- Compliance and audit are quietly central.
audit+7.9pp,EU AI Act / regulation+6.2pp. - Robotics / Embodied AI +9.4pp — larger than the discourse suggests.
What's in the release
src/— scraper, classifier, founder analysis, trend analysis, visualisation, report writerconfig/— taxonomy, keywords, cohorts (all YAML; tune by editing)data/processed/raw_companies.json— the full merged scrape (9 MB); ships with the release for one-command reproductionoutputs/charts/— 16 chartsoutputs/tables/— 25 CSVs (cohort metrics, axis shares, deltas, emerging, term freqs, founder signals, wedge analysis)outputs/analysis_summary.md— full narrativeoutputs/key_findings.md— one-page bullet summaryoutputs/MEDIUM_ARTICLE.md— long-form article
Reproduce
git clone https://github.com/sjmoran/yc-ai-cohort-analysis.git
cd yc-ai-cohort-analysis
pip install -r requirements.txt
python main.py --skip-scrape # ~20s using shipped dataset
# or
python main.py --no-cache # ~30s, full fresh scrapeEvery classification carries the keyword hits that fired it. Every number in the narrative is traceable to a CSV. The whole thing is deterministic and inspectable.