diff --git a/README.md b/README.md index 27a8bebc..696059f7 100644 --- a/README.md +++ b/README.md @@ -31,129 +31,117 @@ var options = { directory: '/path/to/save/', }; -// with callback -scrape(options, function (error, result) { +// with promise +scrape(options).then((result) => { + /* some code here */ +}).catch((err) => { /* some code here */ }); -// or with promise -scrape(options).then(function (result) { +// or with callback +scrape(options, (error, result) => { /* some code here */ }); ``` -## API -### scrape(options, callback) -Makes requests to `urls` and saves all files found with `sources` to `directory`. - -**options** - object containing next options: - - - `urls`: array of urls to load and filenames for them *(required, see example below)* - - `directory`: path to save loaded files *(required)* - - `sources`: array of objects to load, specifies selectors and attribute values to select files for loading *(optional, see example below)* - - `recursive`: boolean, if `true` scraper will follow anchors in html files. Don't forget to set `maxDepth` to avoid infinite downloading *(optional, see example below)* - - `maxDepth`: positive number, maximum allowed depth for dependencies *(optional, see example below)* - - `request`: object, custom options for [request](https://github.com/request/request#requestoptions-callback) *(optional, see example below)* - - `subdirectories`: array of objects, specifies subdirectories for file extensions. If `null` all files will be saved to `directory` *(optional, see example below)* - - `defaultFilename`: filename for index page *(optional, default: 'index.html')* - - `prettifyUrls`: whether urls should be 'prettified', by having the `defaultFilename` removed *(optional, default: false)* - - `ignoreErrors`: boolean, if `true` scraper will continue downloading resources after error occured, if `false` - scraper will finish process and return error *(optional, default: true)* - - `urlFilter`: function which is called for each url to check whether it should be scraped. *(optional, see example below)* - - `filenameGenerator`: name of one of the bundled filenameGenerators, or a custom filenameGenerator function *(optional, default: 'byType')* - - `httpResponseHandler`: function which is called on each response, allows to customize resource or reject its downloading *(optional, see example below)* +## options +* [urls](#urls) - urls to download, *required* +* [directory](#directory) - path to save files, *required* +* [sources](#sources) - selects which resources should be downloaded +* [recursive](#recursive) - follow anchors in html files +* [maxDepth](#maxdepth) - maximum depth for dependencies +* [request](#request) - custom options for for [request](https://github.com/request/request) +* [subdirectories](#subdirectories) - subdirectories for file extensions +* [defaultFilename](#defaultfilename) - filename for index page +* [prettifyUrls](#prettifyurls) - prettify urls +* [ignoreErrors](#ignoreerrors) - whether to ignore errors on resource downloading +* [urlFilter](#urlfilter) - skip some urls +* [filenameGenerator](#filenamegenerator) - generate filename for downloaded resource +* [httpResponseHandler](#httpresponsehandler) - customize http response handling Default options you can find in [lib/config/defaults.js](https://github.com/s0ph1e/node-website-scraper/blob/master/lib/config/defaults.js). - -**callback** - callback function *(optional)*, includes following parameters: - - - `error`: if error - `Error` object, if success - `null` - - `result`: if error - `null`, if success - array of [Resource](https://github.com/s0ph1e/node-website-scraper/blob/master/lib/resource.js) objects containing: - - `url`: url of loaded page - - `filename`: filename where page was saved (relative to `directory`) - - `children`: array of children Resources - -### Filename Generators -The filename generator determines where the scraped files are saved. - -#### byType (default) -When the `byType` filenameGenerator is used the downloaded files are saved by type (as defined by the `subdirectories` setting) -or directly in the `directory` folder, if no subdirectory is specified for the specific type. - -#### bySiteStructure -When the `bySiteStructure` filenameGenerator is used the downloaded files are saved in `directory` using same structure as on the website: -- `/` => `DIRECTORY/index.html` -- `/about` => `DIRECTORY/about/index.html` -- `/resources/javascript/libraries/jquery.min.js` => `DIRECTORY/resources/javascript/libraries/jquery.min.js` - -### Http Response Handlers -HttpResponseHandler is used to reject resource downloading or customize resource text based on response data (for example, status code, content type, etc.) -Function takes `response` argument - response object of [request](https://github.com/request/request) module and should return resolved `Promise` if resource should be downloaded or rejected with Error `Promise` if it should be skipped. -Promise should be resolved with: -* `string` which contains response body -* or object with properies `body` (response body, string) and `metadata` - everything you want to save for this resource (like headers, original text, timestamps, etc.), scraper will not use this field at all, it is only for result. - -See [example of using httpResponseHandler](#example-5-rejecting-resources-with-404-status-and-adding-metadata). - -## Examples -#### Example 1 -Let's scrape some pages from [http://nodejs.org/](http://nodejs.org/) with images, css, js files and save them to `/path/to/save/`. -Imagine we want to load: - - [Home page](http://nodejs.org/) to `index.html` - - [About page](http://nodejs.org/about/) to `about.html` - - [Blog](http://blog.nodejs.org/) to `blog.html` - -and separate files into directories: - - - `img` for .jpg, .png, .svg (full path `/path/to/save/img`) - - `js` for .js (full path `/path/to/save/js`) - - `css` for .css (full path `/path/to/save/css`) - +#### urls +Array of objects which contain urls to download and filenames for them. **_Required_**. ```javascript -var scrape = require('website-scraper'); scrape({ urls: [ 'http://nodejs.org/', // Will be saved with default filename 'index.html' {url: 'http://nodejs.org/about', filename: 'about.html'}, {url: 'http://blog.nodejs.org/', filename: 'blog.html'} ], + directory: '/path/to/save' +}).then(console.log).catch(console.log); +``` + +#### directory +String, absolute path to directory where downloaded files will be saved. Directory should not exist. It will be created by scraper. **_Required_**. + +#### sources +Array of objects to download, specifies selectors and attribute values to select files for downloading. By default scraper tries to download all possible resources. +```javascript +// Downloading images, css files and scripts +scrape({ + urls: ['http://nodejs.org/'], directory: '/path/to/save', - subdirectories: [ - {directory: 'img', extensions: ['.jpg', '.png', '.svg']}, - {directory: 'js', extensions: ['.js']}, - {directory: 'css', extensions: ['.css']} - ], sources: [ {selector: 'img', attr: 'src'}, {selector: 'link[rel="stylesheet"]', attr: 'href'}, {selector: 'script', attr: 'src'} - ], + ] +}).then(console.log).catch(console.log); +``` + +#### recursive +Boolean, if `true` scraper will follow anchors in html files. Don't forget to set `maxDepth` to avoid infinite downloading. Defaults to `false`. + +#### maxDepth +Positive number, maximum allowed depth for dependencies. Defaults to `null` - no maximum depth set. + +#### request +Object, custom options for [request](https://github.com/request/request#requestoptions-callback). Allows to set cookies, userAgent, etc. +```javascript +scrape({ + urls: ['http://example.com/'], + directory: '/path/to/save', request: { headers: { 'User-Agent': 'Mozilla/5.0 (Linux; Android 4.2.1; en-us; Nexus 4 Build/JOP40D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166 Mobile Safari/535.19' } } -}).then(function (result) { - console.log(result); -}).catch(function(err){ - console.log(err); -}); +}).then(console.log).catch(console.log); ``` -#### Example 2. Recursive downloading +#### subdirectories +Array of objects, specifies subdirectories for file extensions. If `null` all files will be saved to `directory`. ```javascript -// Links from example.com will be followed -// Links from links will be ignored because theirs depth = 2 is greater than maxDepth -var scrape = require('website-scraper'); +/* Separate files into directories: + - `img` for .jpg, .png, .svg (full path `/path/to/save/img`) + - `js` for .js (full path `/path/to/save/js`) + - `css` for .css (full path `/path/to/save/css`) +*/ scrape({ - urls: ['http://example.com/'], + urls: ['http://example.com'], directory: '/path/to/save', - recursive: true, - maxDepth: 1 + subdirectories: [ + {directory: 'img', extensions: ['.jpg', '.png', '.svg']}, + {directory: 'js', extensions: ['.js']}, + {directory: 'css', extensions: ['.css']} + ] }).then(console.log).catch(console.log); ``` -#### Example 3. Filtering out external resources +#### defaultFilename +String, filename for index page. Defaults to `index.html`. + +#### prettifyUrls +Boolean, whether urls should be 'prettified', by having the `defaultFilename` removed. Defaults to `false`. + +#### ignoreErrors +Boolean, if `true` scraper will continue downloading resources after error occured, if `false` - scraper will finish process and return error. Defaults to `true`. + +#### urlFilter +Function which is called for each url to check whether it should be scraped. Defaults to `null` - no url filter will be applied. ```javascript // Links to other websites are filtered out by the urlFilter var scrape = require('website-scraper'); @@ -166,28 +154,40 @@ scrape({ }).then(console.log).catch(console.log); ``` -#### Example 4. Downloading an entire website +#### filenameGenerator +String, name of one of the bundled filenameGenerators, or a custom filenameGenerator function. Filename generator determines where the scraped files are saved. + +###### byType (default) +When the `byType` filenameGenerator is used the downloaded files are saved by type (as defined by the `subdirectories` setting) or directly in the `directory` folder, if no subdirectory is specified for the specific type. + +###### bySiteStructure +When the `bySiteStructure` filenameGenerator is used the downloaded files are saved in `directory` using same structure as on the website: +- `/` => `DIRECTORY/index.html` +- `/about` => `DIRECTORY/about/index.html` +- `/resources/javascript/libraries/jquery.min.js` => `DIRECTORY/resources/javascript/libraries/jquery.min.js` + ```javascript -// Downloads all the crawlable files of example.com. -// The files are saved in the same structure as the structure of the website, by using the `bySiteStructure` filenameGenerator. +// Downloads all the crawlable files. The files are saved in the same structure as the structure of the website // Links to other websites are filtered out by the urlFilter var scrape = require('website-scraper'); scrape({ urls: ['http://example.com/'], - urlFilter: function(url){ - return url.indexOf('http://example.com') === 0; - }, + urlFilter: function(url){ return url.indexOf('http://example.com') === 0; }, recursive: true, maxDepth: 100, - prettifyUrls: true, filenameGenerator: 'bySiteStructure', directory: '/path/to/save' }).then(console.log).catch(console.log); ``` -#### Example 5. Rejecting resources with 404 status and adding metadata +#### httpResponseHandler +Function which is called on each response, allows to customize resource or reject its downloading. +It takes 1 argument - response object of [request](https://github.com/request/request) module and should return resolved `Promise` if resource should be downloaded or rejected with Error `Promise` if it should be skipped. +Promise should be resolved with: +* `string` which contains response body +* or object with properies `body` (response body, string) and `metadata` - everything you want to save for this resource (like headers, original text, timestamps, etc.), scraper will not use this field at all, it is only for result. ```javascript -var scrape = require('website-scraper'); +// Rejecting resources with 404 status and adding metadata to other resources scrape({ urls: ['http://example.com/'], directory: '/path/to/save', @@ -207,6 +207,15 @@ scrape({ } }).then(console.log).catch(console.log); ``` +Scrape function resolves with array of [Resource](https://github.com/s0ph1e/node-website-scraper/blob/master/lib/resource.js) objects which contain `metadata` property from `httpResponseHandler`. + +## callback +Callback function, optional, includes following parameters: + - `error`: if error - `Error` object, if success - `null` + - `result`: if error - `null`, if success - array of [Resource](https://github.com/s0ph1e/node-website-scraper/blob/master/lib/resource.js) objects containing: + - `url`: url of loaded page + - `filename`: filename where page was saved (relative to `directory`) + - `children`: array of children Resources ## Log and debug This module uses [debug](https://github.com/visionmedia/debug) to log events. To enable logs you should use environment variable `DEBUG`.