Skip to content

Latest commit

 

History

History

node-web-fetch

😎 @web-master/node-web-fetch 😎

Fetch web data as easy as possible

Package License (MIT)

Description

It is the combination of @web-master/node-web-crawler and @web-master/node-web-scraper.

It can:

  • FETCH
    • SCRAPE
      • It scrapes the specific page
      • It gathers data from the page according to the ScrapeConfig
    • CRAWL
      • It scrapes the specific page and gathers links
      • It crawls the links and scrapes each page of the link
      • It gathers data from each page according to CrawlConfig

Installation

$ npm install --save @web-master/node-web-fetch

Usage

Single Page Scraping

Basic

import fetch from '@web-master/node-web-fetch';

const data = await fetch({
  target: 'http://example.com',
  fetch: {
    title: 'h1',
    info: {
      selector: 'p > a',
      attr: 'href',
    },
  },
});

console.log(data);
// {
//   title: 'Example Domain',
//   info: 'http://www.iana.org/domains/example'
// }

Waitable (by using puppeteer)

import fetch from '@web-master/node-web-fetch';

const data = await fetch({
  target: 'http://example.com',
  waitFor: 3 * 1000, // wait for the content loaded! (like single page apps)
  fetch: {
    title: 'h1',
    info: {
      selector: 'p > a',
      attr: 'href',
    },
  },
});

console.log(data);
// {
//   title: 'Example Domain',
//   info: 'http://www.iana.org/domains/example'
// }

Multi Pages Crawling

You Know the target urls already

import fetch from '@web-master/node-web-fetch';

const pages = await fetch({
  target: [
    'https://example1.com',
    'https://example2.com',
    'https://example3.com',
  ],
  fetch: () => ({
    title: 'h1',
  }),
});

console.log(pages);
// [
//   { title: 'An easiest crawling and scraping module for NestJS' },
//   { title: 'A minimalistic boilerplate on top of Webpack, Babel, TypeScript and React' },
//   { title: '[Experimental] React SSR as a view template engine' }
// ]

You Don't Know the Target Urls so Want to Crawl Dynamically

import fetch from '@web-master/node-web-fetch';

const pages = await fetch({
  target: {
    url: 'https://news.ycombinator.com',
    iterator: {
      selector: 'span.age > a',
      convert: (x) => `https://news.ycombinator.com/${x}`,
    },
  },
  fetch: () => ({
    title: '.title > a',
  }),
});

console.log(pages);
// [
//   { title: 'An easiest crawling and scraping module for NestJS' },
//   { title: 'A minimalistic boilerplate on top of Webpack, Babel, TypeScript and React' },
//   ...
//   ...
//   { title: '[Experimental] React SSR as a view template engine' }
// ]

Waitable (by using puppeteer)

import fetch from '@web-master/node-web-fetch';

const pages = await fetch({
  target: {
    url: 'https://news.ycombinator.com',
    iterator: {
      selector: 'span.age > a',
      convert: (x) => `https://news.ycombinator.com/${x}`,
    },
  },
  waitFor: 3 * 1000, // wait for the content loaded! (like single page apps)
  fetch: () => ({
    title: '.title > a',
  }),
});

console.log(pages);
// [
//   { title: 'An easiest crawling and scraping module for NestJS' },
//   { title: 'A minimalistic boilerplate on top of Webpack, Babel, TypeScript and React' },
//   ...
//   ...
//   { title: '[Experimental] React SSR as a view template engine' }
// ]

TypeScript Support

import fetch from '@web-master/node-web-fetch';

interface HackerNewsPage {
  title: string;
}

const pages: HackerNewsPage[] = await fetch({
  target: {
    url: 'https://news.ycombinator.com',
    iterator: {
      selector: 'span.age > a',
      convert: (x) => `https://news.ycombinator.com/${x}`,
    },
  },
  fetch: () => ({
    title: '.title > a',
  }),
});

console.log(pages);
// [
//   { title: 'An easiest crawling and scraping module for NestJS' },
//   { title: 'A minimalistic boilerplate on top of Webpack, Babel, TypeScript and React' },
//   ...
//   ...
//   { title: '[Experimental] React SSR as a view template engine' }
// ]

Related