Fast & Easy crawling framework deeply integrated with TypeORM. Built on top of Crawlee.
CrawORM lets you describe what to scrape and where to store it on the same class, using TypeScript decorators you already know from TypeORM.
@Entity()
@Crawlable({ url: 'https://news.ycombinator.com/', listSelector: 'tr.athing' })
class Story {
@PrimaryColumn() @Selector({ selector: '', pick: 'attr', attr: 'id' }) id!: string;
@Column() @Selector('span.titleline > a') title!: string;
@Column() @Href('span.titleline > a') url!: string;
}
const orm = new CrawORM({ dataSource });
await orm.init();
await orm.crawler(Story).run();That's it. URLs deduped, requests retried, sessions rotated, results upserted into your database.
Existing options force you to maintain three parallel descriptions of the same data:
- A TypeORM entity for the database schema
- A scrape function that picks fields out of HTML
- A DTO / interface so TypeScript knows the shape
CrawORM collapses all three into one decorated class. The ORM, the extractor, and the type are the same object.
| Goal | How |
|---|---|
| Fast | Cheerio extraction in-process; bulk transactional inserts; lazy Crawlee imports |
| Easy | One class, one decorator-stack, sane defaults — await orm.crawler(X).run() works |
| Safe at scale | CrawlState table = idempotent re-runs, crash recovery, conditional GETs |
| Escape hatches | crawlRepo.orm is the raw TypeORM repository; crawleeOptions forwards to Crawlee |
npm install craw-orm typeorm crawlee reflect-metadata
# Optional engines:
npm install playwright # for JS-rendered sites
npm install pg sqlite3 mysql2 # whichever DB driver you needtsconfig.json must enable decorators:
{
"compilerOptions": {
"experimentalDecorators": true,
"emitDecoratorMetadata": true,
"target": "ES2022"
}
}| Decorator | Purpose |
|---|---|
@Crawlable(meta) |
Marks a class as a crawl target, declares URL/engine/pagination/conflict strategy |
@Selector(css) |
Extract trimmed text from a CSS selector |
@XPath(xpath) |
XPath equivalent (Playwright engines only) |
@Attr(name, css) |
Extract an attribute |
@Href(css) / @Src(css) |
Extract href/src (auto-resolved to absolute URLs) |
@Html(css) |
Extract inner HTML |
@Regex(pattern) |
Apply a regex on top of the previous selector |
@Constant(value) |
Pin a static value to a field |
@Nested(EntityCtor, css) |
Extract child entities scoped to a container |
CrawORM auto-creates a craw_orm_state table that tracks every URL ever crawled:
- Idempotency — re-running a crawl with
staleAfterMsskips fresh URLs - Recovery —
resume: truepicks up URLs left inpendingafter a crash - Auditability — query failed URLs, success rates, last-error messages
Add it to your DataSource entities:
import { CrawlState } from 'craw-orm';
const dataSource = new DataSource({ entities: [..., CrawlState] });Set on @Crawlable({ onConflict: ... }):
skip— first wins, subsequent crawls don't change the rowoverwrite— replace all fieldsmerge— update only non-null fields from the new crawlupsert— DB-level UPSERT on the primary key (default)version— relies on@VersionColumn, increments on save
| Engine | When |
|---|---|
cheerio (default) |
Static HTML, fastest |
playwright |
JS-rendered sites, anti-bot protection, XPath support |
puppeteer |
Same as Playwright, alternative driver |
http |
Raw response, no parsing — for APIs |
Set on @Crawlable({ engine: 'playwright' }) or override at runtime: orm.crawler(X, { engine: 'playwright' }).
@Crawlable({
hooks: {
beforeRequest: async ({ url }) => { /* dismiss banners, log in */ },
afterParse: async (entity) => { /* normalise fields */ },
beforeSave: (entity) => entity.price > 0, // return false to skip
afterSave: async (entity) => { /* notify, index, etc. */ },
onError: async (err) => { /* alerting */ },
}
})When you need raw access:
// Raw TypeORM repository — for queries, query builder, transactions:
orm.repository(Product).orm.createQueryBuilder('p')...
// Raw Crawlee config — proxy pools, session rotation, fingerprints:
orm.crawler(Product, {
crawleeOptions: {
sessionPoolOptions: { maxPoolSize: 200 },
browserPoolOptions: { /* ... */ },
}
});@Selectorextraction is in-memory Cheerio — ~50µs per field on typical pages- Persistence batches in chunks (default 100) inside transactions — ~2-3ms per row on Postgres
- For >10k pages/day, prefer Postgres over SQLite (write contention)
- For >1M pages/day, consider sharding by
tagand running parallel crawler instances
Experimental. API may change. Pin to a specific version in production.
MIT.