Skip to content

ytrsoft/craw-orm

Repository files navigation

CrawORM

Fast & Easy crawling framework deeply integrated with TypeORM. Built on top of Crawlee.

CrawORM lets you describe what to scrape and where to store it on the same class, using TypeScript decorators you already know from TypeORM.

@Entity()
@Crawlable({ url: 'https://news.ycombinator.com/', listSelector: 'tr.athing' })
class Story {
  @PrimaryColumn() @Selector({ selector: '', pick: 'attr', attr: 'id' }) id!: string;
  @Column()       @Selector('span.titleline > a')                       title!: string;
  @Column()       @Href('span.titleline > a')                           url!: string;
}

const orm = new CrawORM({ dataSource });
await orm.init();
await orm.crawler(Story).run();

That's it. URLs deduped, requests retried, sessions rotated, results upserted into your database.


Why CrawORM?

Existing options force you to maintain three parallel descriptions of the same data:

  1. A TypeORM entity for the database schema
  2. A scrape function that picks fields out of HTML
  3. A DTO / interface so TypeScript knows the shape

CrawORM collapses all three into one decorated class. The ORM, the extractor, and the type are the same object.

Design Goals

Goal How
Fast Cheerio extraction in-process; bulk transactional inserts; lazy Crawlee imports
Easy One class, one decorator-stack, sane defaults — await orm.crawler(X).run() works
Safe at scale CrawlState table = idempotent re-runs, crash recovery, conditional GETs
Escape hatches crawlRepo.orm is the raw TypeORM repository; crawleeOptions forwards to Crawlee

Install

npm install craw-orm typeorm crawlee reflect-metadata
# Optional engines:
npm install playwright          # for JS-rendered sites
npm install pg sqlite3 mysql2   # whichever DB driver you need

tsconfig.json must enable decorators:

{
  "compilerOptions": {
    "experimentalDecorators": true,
    "emitDecoratorMetadata": true,
    "target": "ES2022"
  }
}

The Decorators

Decorator Purpose
@Crawlable(meta) Marks a class as a crawl target, declares URL/engine/pagination/conflict strategy
@Selector(css) Extract trimmed text from a CSS selector
@XPath(xpath) XPath equivalent (Playwright engines only)
@Attr(name, css) Extract an attribute
@Href(css) / @Src(css) Extract href/src (auto-resolved to absolute URLs)
@Html(css) Extract inner HTML
@Regex(pattern) Apply a regex on top of the previous selector
@Constant(value) Pin a static value to a field
@Nested(EntityCtor, css) Extract child entities scoped to a container

The CrawlState Table

CrawORM auto-creates a craw_orm_state table that tracks every URL ever crawled:

  • Idempotency — re-running a crawl with staleAfterMs skips fresh URLs
  • Recoveryresume: true picks up URLs left in pending after a crash
  • Auditability — query failed URLs, success rates, last-error messages

Add it to your DataSource entities:

import { CrawlState } from 'craw-orm';
const dataSource = new DataSource({ entities: [..., CrawlState] });

Conflict Strategies

Set on @Crawlable({ onConflict: ... }):

  • skip — first wins, subsequent crawls don't change the row
  • overwrite — replace all fields
  • merge — update only non-null fields from the new crawl
  • upsert — DB-level UPSERT on the primary key (default)
  • version — relies on @VersionColumn, increments on save

Engine Selection

Engine When
cheerio (default) Static HTML, fastest
playwright JS-rendered sites, anti-bot protection, XPath support
puppeteer Same as Playwright, alternative driver
http Raw response, no parsing — for APIs

Set on @Crawlable({ engine: 'playwright' }) or override at runtime: orm.crawler(X, { engine: 'playwright' }).

Hooks

@Crawlable({
  hooks: {
    beforeRequest: async ({ url }) => { /* dismiss banners, log in */ },
    afterParse:    async (entity)  => { /* normalise fields */ },
    beforeSave:    (entity)        => entity.price > 0,  // return false to skip
    afterSave:     async (entity)  => { /* notify, index, etc. */ },
    onError:       async (err)     => { /* alerting */ },
  }
})

Escape Hatches

When you need raw access:

// Raw TypeORM repository — for queries, query builder, transactions:
orm.repository(Product).orm.createQueryBuilder('p')...

// Raw Crawlee config — proxy pools, session rotation, fingerprints:
orm.crawler(Product, {
  crawleeOptions: {
    sessionPoolOptions: { maxPoolSize: 200 },
    browserPoolOptions: { /* ... */ },
  }
});

Performance Notes

  • @Selector extraction is in-memory Cheerio — ~50µs per field on typical pages
  • Persistence batches in chunks (default 100) inside transactions — ~2-3ms per row on Postgres
  • For >10k pages/day, prefer Postgres over SQLite (write contention)
  • For >1M pages/day, consider sharding by tag and running parallel crawler instances

Status

Experimental. API may change. Pin to a specific version in production.

License

MIT.

About

Fast & Easy crawling framework deeply integrated with TypeORM. Built on Crawlee

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors