CrawORM

Fast & Easy crawling framework deeply integrated with TypeORM. Built on top of Crawlee.

CrawORM lets you describe what to scrape and where to store it on the same class, using TypeScript decorators you already know from TypeORM.

@Entity()
@Crawlable({ url: 'https://news.ycombinator.com/', listSelector: 'tr.athing' })
class Story {
  @PrimaryColumn() @Selector({ selector: '', pick: 'attr', attr: 'id' }) id!: string;
  @Column()       @Selector('span.titleline > a')                       title!: string;
  @Column()       @Href('span.titleline > a')                           url!: string;
}

const orm = new CrawORM({ dataSource });
await orm.init();
await orm.crawler(Story).run();

That's it. URLs deduped, requests retried, sessions rotated, results upserted into your database.

Why CrawORM?

Existing options force you to maintain three parallel descriptions of the same data:

A TypeORM entity for the database schema
A scrape function that picks fields out of HTML
A DTO / interface so TypeScript knows the shape

CrawORM collapses all three into one decorated class. The ORM, the extractor, and the type are the same object.

Design Goals

Goal	How
Fast	Cheerio extraction in-process; bulk transactional inserts; lazy Crawlee imports
Easy	One class, one decorator-stack, sane defaults — `await orm.crawler(X).run()` works
Safe at scale	CrawlState table = idempotent re-runs, crash recovery, conditional GETs
Escape hatches	`crawlRepo.orm` is the raw TypeORM repository; `crawleeOptions` forwards to Crawlee

Install

npm install craw-orm typeorm crawlee reflect-metadata
# Optional engines:
npm install playwright          # for JS-rendered sites
npm install pg sqlite3 mysql2   # whichever DB driver you need

tsconfig.json must enable decorators:

{
  "compilerOptions": {
    "experimentalDecorators": true,
    "emitDecoratorMetadata": true,
    "target": "ES2022"
  }
}

The Decorators

Decorator	Purpose
`@Crawlable(meta)`	Marks a class as a crawl target, declares URL/engine/pagination/conflict strategy
`@Selector(css)`	Extract trimmed text from a CSS selector
`@XPath(xpath)`	XPath equivalent (Playwright engines only)
`@Attr(name, css)`	Extract an attribute
`@Href(css)` / `@Src(css)`	Extract `href`/`src` (auto-resolved to absolute URLs)
`@Html(css)`	Extract inner HTML
`@Regex(pattern)`	Apply a regex on top of the previous selector
`@Constant(value)`	Pin a static value to a field
`@Nested(EntityCtor, css)`	Extract child entities scoped to a container

The CrawlState Table

CrawORM auto-creates a craw_orm_state table that tracks every URL ever crawled:

Idempotency — re-running a crawl with staleAfterMs skips fresh URLs
Recovery — resume: true picks up URLs left in pending after a crash
Auditability — query failed URLs, success rates, last-error messages

Add it to your DataSource entities:

import { CrawlState } from 'craw-orm';
const dataSource = new DataSource({ entities: [..., CrawlState] });

Conflict Strategies

Set on @Crawlable({ onConflict: ... }):

skip — first wins, subsequent crawls don't change the row
overwrite — replace all fields
merge — update only non-null fields from the new crawl
upsert — DB-level UPSERT on the primary key (default)
version — relies on @VersionColumn, increments on save

Engine Selection

Engine	When
`cheerio` (default)	Static HTML, fastest
`playwright`	JS-rendered sites, anti-bot protection, XPath support
`puppeteer`	Same as Playwright, alternative driver
`http`	Raw response, no parsing — for APIs

Set on @Crawlable({ engine: 'playwright' }) or override at runtime: orm.crawler(X, { engine: 'playwright' }).

Hooks

@Crawlable({
  hooks: {
    beforeRequest: async ({ url }) => { /* dismiss banners, log in */ },
    afterParse:    async (entity)  => { /* normalise fields */ },
    beforeSave:    (entity)        => entity.price > 0,  // return false to skip
    afterSave:     async (entity)  => { /* notify, index, etc. */ },
    onError:       async (err)     => { /* alerting */ },
  }
})

Escape Hatches

When you need raw access:

// Raw TypeORM repository — for queries, query builder, transactions:
orm.repository(Product).orm.createQueryBuilder('p')...

// Raw Crawlee config — proxy pools, session rotation, fingerprints:
orm.crawler(Product, {
  crawleeOptions: {
    sessionPoolOptions: { maxPoolSize: 200 },
    browserPoolOptions: { /* ... */ },
  }
});

Performance Notes

@Selector extraction is in-memory Cheerio — ~50µs per field on typical pages
Persistence batches in chunks (default 100) inside transactions — ~2-3ms per row on Postgres
For >10k pages/day, prefer Postgres over SQLite (write contention)
For >1M pages/day, consider sharding by tag and running parallel crawler instances

Status

Experimental. API may change. Pin to a specific version in production.

License

MIT.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
examples		examples
src		src
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
eslint.config.js		eslint.config.js
package.json		package.json
tsconfig.examples.json		tsconfig.examples.json
tsconfig.json		tsconfig.json
vitest.config.ts		vitest.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CrawORM

Why CrawORM?

Design Goals

Install

The Decorators

The CrawlState Table

Conflict Strategies

Engine Selection

Hooks

Escape Hatches

Performance Notes

Status

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CrawORM

Why CrawORM?

Design Goals

Install

The Decorators

The CrawlState Table

Conflict Strategies

Engine Selection

Hooks

Escape Hatches

Performance Notes

Status

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages