A library for ensuring polite outgoing HTTP requests that respect robots.txt and aren't made too close to each other
npm install fetch-politely --save
Simple:
const fetchInstance = new FetchPolitely((err, url, message) => {
if (err) { return; }
// The URL has been cleared for fetching – the hostname isn't throttled and robots.txt doesn't ban it
}, {
// Robots.txt checking requires specification of a User Agent as Robots.txt can contain User Agent specific rules
// See http://en.wikipedia.org/wiki/User_agent for more info in format
userAgent: 'Your-Application-Name/your.app.version (http://example.com/optional/full/app/url)',
});
// When a slot has been reserved the callback sent in the constructor will be called
fetchInstance.requestSlot('http://foo.example.org/interesting/content/');
var fetchInstance = new FetchPolitely(callback, [options]);
- callback –
(err, url, message, [content]) => {};
, called for each succesful request slot
- throttleDuration – for how long in milliseconds to throttle requests to each
hostname
. Defaults to10
seconds. - returnContent – whether to fetch and return the content with the callback when a URL has received a request slot. Defaults to
false
. - logger – a Bunyan compatible logger library. Defaults to bunyan-duckling which uses
console.log()
/.error()
. - lookup – an object or class that keeps track of throttled hosts and queued URL:s. Defaults to
PoliteLookup
. - lookupOptions – an object that defines extra lookup options.
- allowed – a function that checks whether a URL is allowed to be fetched. Defaults to
PoliteRobot.allowed()
. - robotCache – a cache method used by
PoliteRobot
to cache fetchedrobots.txt
. Defaults to wrapped lru-cache. - robotCacheLimit – a limit of the number of items to keep in the default lru-cache of
PoliteRobot
. - robotPool – an HTTP agent to use for the request-library of
PoliteRobot
. - userAgent – required by
PoliteRobot
andoptions.returnContent
. The User Agent to use for HTTP requests.
- requestSlot – tries to reserve a request slot for a URL. Returns a Promise that will be resolved or rejected when the request has been made.
- FetchPolitely.PoliteError – a very polite error object used for eg. informing about denied URL:s
- FetchPolitely.PoliteLookup – defines the interface for keeping track of throttled hosts and queued URL:s
- FetchPolitely.PolitePGLookup – alternative lookup that uses PostgreSQL as the backend
- FetchPolitely.PoliteRobot – checks whether URL:s are allowed to be fetched according to Robots.txt.
fetchInstance.requestSlot(url, [message], [options]);
- url – the URL to reserve a request slot for
- message – a JSON-encodeable optional message containing eg. instructions for the
FetchPolitely
callback.
- allow – if set to
true
the URL will always be allowd and not be sent to theallowed
function. - allowDuplicates – if set to
false
no more than one item of everyurl
+message
combination will be queued.
The simplest of simple implementations for keeping track of throttled hosts and queued URL:s. Handles it all in-memory. Same interface can be used to build a database backend for this though.
A PostgreSQL + Knex-driven lookup that throttles hosts and queues URL using database tables.
Use by setting up the tables in pglookup.sql
and include by setting the FetchPolitely
options to:
{
lookup: FetchPolitely.PolitePGLookup,
lookupOptions: {
knex: knexInstance
}
}
Pull Requests are welcome if someone wants to pull out the Knex-dependency. Most projects where this has been used with Postgres has been using Knex so it got used here as well.
- knex – required – the database connection to use, provided through a Knex object.
- purgeWindow – the minimum interval in milliseconds between two host purges. Defaults to
500
ms. - concurrentReleases – how many parallell database lookups to perform to check for released URL:s. Defaults to
2
. - releasesPerBatch – how many URL:s to fetch in each database lookup. Defaults to
5
. - onlyDeduplicateMessages – bool that if set will only deduplicate URL:s with the same message when deduplicating. Defaults to
false
.
npm test