Skip to content

Ensures polite outgoing HTTP requests that respects robots.txt and aren't made too close to each other

License

Notifications You must be signed in to change notification settings

voxpelli/node-fetch-politely

Repository files navigation

Fetch Politely

js-semistandard-style

A library for ensuring polite outgoing HTTP requests that respect robots.txt and aren't made too close to each other

Installation

npm install fetch-politely --save

Usage

Simple:

const fetchInstance = new FetchPolitely((err, url, message) => {
  if (err) { return; }

  // The URL has been cleared for fetching – the hostname isn't throttled and robots.txt doesn't ban it
}, {
  // Robots.txt checking requires specification of a User Agent as Robots.txt can contain User Agent specific rules
  // See http://en.wikipedia.org/wiki/User_agent for more info in format
  userAgent: 'Your-Application-Name/your.app.version (http://example.com/optional/full/app/url)',
});

// When a slot has been reserved the callback sent in the constructor will be called
fetchInstance.requestSlot('http://foo.example.org/interesting/content/');

FetchPolitely()

var fetchInstance = new FetchPolitely(callback, [options]);

Parameters

  • callback(err, url, message, [content]) => {};, called for each succesful request slot

Options

  • throttleDuration – for how long in milliseconds to throttle requests to each hostname. Defaults to 10 seconds.
  • returnContent – whether to fetch and return the content with the callback when a URL has received a request slot. Defaults to false.
  • logger – a Bunyan compatible logger library. Defaults to bunyan-duckling which uses console.log()/.error().
  • lookup – an object or class that keeps track of throttled hosts and queued URL:s. Defaults to PoliteLookup.
  • lookupOptions – an object that defines extra lookup options.
  • allowed – a function that checks whether a URL is allowed to be fetched. Defaults to PoliteRobot.allowed().
  • robotCache – a cache method used by PoliteRobot to cache fetched robots.txt. Defaults to wrapped lru-cache.
  • robotCacheLimit – a limit of the number of items to keep in the default lru-cache of PoliteRobot.
  • robotPool – an HTTP agent to use for the request-library of PoliteRobot.
  • userAgentrequired by PoliteRobot and options.returnContent. The User Agent to use for HTTP requests.

Methods

  • requestSlot – tries to reserve a request slot for a URL. Returns a Promise that will be resolved or rejected when the request has been made.

Static

  • FetchPolitely.PoliteError – a very polite error object used for eg. informing about denied URL:s
  • FetchPolitely.PoliteLookup – defines the interface for keeping track of throttled hosts and queued URL:s
  • FetchPolitely.PolitePGLookup – alternative lookup that uses PostgreSQL as the backend
  • FetchPolitely.PoliteRobot – checks whether URL:s are allowed to be fetched according to Robots.txt.

fetchInstance.requestSlot()

fetchInstance.requestSlot(url, [message], [options]);

Parameters

  • url – the URL to reserve a request slot for
  • message – a JSON-encodeable optional message containing eg. instructions for the FetchPolitely callback.

Options

  • allow – if set to true the URL will always be allowd and not be sent to the allowed function.
  • allowDuplicates – if set to false no more than one item of every url + message combination will be queued.

PoliteLookup

The simplest of simple implementations for keeping track of throttled hosts and queued URL:s. Handles it all in-memory. Same interface can be used to build a database backend for this though.

PolitePGLookup

A PostgreSQL + Knex-driven lookup that throttles hosts and queues URL using database tables.

Use by setting up the tables in pglookup.sql and include by setting the FetchPolitely options to:

{
  lookup: FetchPolitely.PolitePGLookup,
  lookupOptions: {
    knex: knexInstance
  }
}

Pull Requests are welcome if someone wants to pull out the Knex-dependency. Most projects where this has been used with Postgres has been using Knex so it got used here as well.

lookupOptions

  • knexrequired – the database connection to use, provided through a Knex object.
  • purgeWindow – the minimum interval in milliseconds between two host purges. Defaults to 500 ms.
  • concurrentReleases – how many parallell database lookups to perform to check for released URL:s. Defaults to 2.
  • releasesPerBatch – how many URL:s to fetch in each database lookup. Defaults to 5.
  • onlyDeduplicateMessages – bool that if set will only deduplicate URL:s with the same message when deduplicating. Defaults to false.

Lint / Test

npm test

About

Ensures polite outgoing HTTP requests that respects robots.txt and aren't made too close to each other

Resources

License

Security policy

Stars

Watchers

Forks

Sponsor this project

 

Packages

No packages published