Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
203 changes: 106 additions & 97 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,129 +31,117 @@ var options = {
directory: '/path/to/save/',
};

// with callback
scrape(options, function (error, result) {
// with promise
scrape(options).then((result) => {
/* some code here */
}).catch((err) => {
/* some code here */
});

// or with promise
scrape(options).then(function (result) {
// or with callback
scrape(options, (error, result) => {
/* some code here */
});
```

## API
### scrape(options, callback)
Makes requests to `urls` and saves all files found with `sources` to `directory`.

**options** - object containing next options:

- `urls`: array of urls to load and filenames for them *(required, see example below)*
- `directory`: path to save loaded files *(required)*
- `sources`: array of objects to load, specifies selectors and attribute values to select files for loading *(optional, see example below)*
- `recursive`: boolean, if `true` scraper will follow anchors in html files. Don't forget to set `maxDepth` to avoid infinite downloading *(optional, see example below)*
- `maxDepth`: positive number, maximum allowed depth for dependencies *(optional, see example below)*
- `request`: object, custom options for [request](https://github.com/request/request#requestoptions-callback) *(optional, see example below)*
- `subdirectories`: array of objects, specifies subdirectories for file extensions. If `null` all files will be saved to `directory` *(optional, see example below)*
- `defaultFilename`: filename for index page *(optional, default: 'index.html')*
- `prettifyUrls`: whether urls should be 'prettified', by having the `defaultFilename` removed *(optional, default: false)*
- `ignoreErrors`: boolean, if `true` scraper will continue downloading resources after error occured, if `false` - scraper will finish process and return error *(optional, default: true)*
- `urlFilter`: function which is called for each url to check whether it should be scraped. *(optional, see example below)*
- `filenameGenerator`: name of one of the bundled filenameGenerators, or a custom filenameGenerator function *(optional, default: 'byType')*
- `httpResponseHandler`: function which is called on each response, allows to customize resource or reject its downloading *(optional, see example below)*
## options
* [urls](#urls) - urls to download, *required*
* [directory](#directory) - path to save files, *required*
* [sources](#sources) - selects which resources should be downloaded
* [recursive](#recursive) - follow anchors in html files
* [maxDepth](#maxdepth) - maximum depth for dependencies
* [request](#request) - custom options for for [request](https://github.com/request/request)
* [subdirectories](#subdirectories) - subdirectories for file extensions
* [defaultFilename](#defaultfilename) - filename for index page
* [prettifyUrls](#prettifyurls) - prettify urls
* [ignoreErrors](#ignoreerrors) - whether to ignore errors on resource downloading
* [urlFilter](#urlfilter) - skip some urls
* [filenameGenerator](#filenamegenerator) - generate filename for downloaded resource
* [httpResponseHandler](#httpresponsehandler) - customize http response handling

Default options you can find in [lib/config/defaults.js](https://github.com/s0ph1e/node-website-scraper/blob/master/lib/config/defaults.js).


**callback** - callback function *(optional)*, includes following parameters:

- `error`: if error - `Error` object, if success - `null`
- `result`: if error - `null`, if success - array of [Resource](https://github.com/s0ph1e/node-website-scraper/blob/master/lib/resource.js) objects containing:
- `url`: url of loaded page
- `filename`: filename where page was saved (relative to `directory`)
- `children`: array of children Resources

### Filename Generators
The filename generator determines where the scraped files are saved.

#### byType (default)
When the `byType` filenameGenerator is used the downloaded files are saved by type (as defined by the `subdirectories` setting)
or directly in the `directory` folder, if no subdirectory is specified for the specific type.

#### bySiteStructure
When the `bySiteStructure` filenameGenerator is used the downloaded files are saved in `directory` using same structure as on the website:
- `/` => `DIRECTORY/index.html`
- `/about` => `DIRECTORY/about/index.html`
- `/resources/javascript/libraries/jquery.min.js` => `DIRECTORY/resources/javascript/libraries/jquery.min.js`

### Http Response Handlers
HttpResponseHandler is used to reject resource downloading or customize resource text based on response data (for example, status code, content type, etc.)
Function takes `response` argument - response object of [request](https://github.com/request/request) module and should return resolved `Promise` if resource should be downloaded or rejected with Error `Promise` if it should be skipped.
Promise should be resolved with:
* `string` which contains response body
* or object with properies `body` (response body, string) and `metadata` - everything you want to save for this resource (like headers, original text, timestamps, etc.), scraper will not use this field at all, it is only for result.

See [example of using httpResponseHandler](#example-5-rejecting-resources-with-404-status-and-adding-metadata).

## Examples
#### Example 1
Let's scrape some pages from [http://nodejs.org/](http://nodejs.org/) with images, css, js files and save them to `/path/to/save/`.
Imagine we want to load:
- [Home page](http://nodejs.org/) to `index.html`
- [About page](http://nodejs.org/about/) to `about.html`
- [Blog](http://blog.nodejs.org/) to `blog.html`

and separate files into directories:

- `img` for .jpg, .png, .svg (full path `/path/to/save/img`)
- `js` for .js (full path `/path/to/save/js`)
- `css` for .css (full path `/path/to/save/css`)

#### urls
Array of objects which contain urls to download and filenames for them. **_Required_**.
```javascript
var scrape = require('website-scraper');
scrape({
urls: [
'http://nodejs.org/', // Will be saved with default filename 'index.html'
{url: 'http://nodejs.org/about', filename: 'about.html'},
{url: 'http://blog.nodejs.org/', filename: 'blog.html'}
],
directory: '/path/to/save'
}).then(console.log).catch(console.log);
```

#### directory
String, absolute path to directory where downloaded files will be saved. Directory should not exist. It will be created by scraper. **_Required_**.

#### sources
Array of objects to download, specifies selectors and attribute values to select files for downloading. By default scraper tries to download all possible resources.
```javascript
// Downloading images, css files and scripts
scrape({
urls: ['http://nodejs.org/'],
directory: '/path/to/save',
subdirectories: [
{directory: 'img', extensions: ['.jpg', '.png', '.svg']},
{directory: 'js', extensions: ['.js']},
{directory: 'css', extensions: ['.css']}
],
sources: [
{selector: 'img', attr: 'src'},
{selector: 'link[rel="stylesheet"]', attr: 'href'},
{selector: 'script', attr: 'src'}
],
]
}).then(console.log).catch(console.log);
```

#### recursive
Boolean, if `true` scraper will follow anchors in html files. Don't forget to set `maxDepth` to avoid infinite downloading. Defaults to `false`.

#### maxDepth
Positive number, maximum allowed depth for dependencies. Defaults to `null` - no maximum depth set.

#### request
Object, custom options for [request](https://github.com/request/request#requestoptions-callback). Allows to set cookies, userAgent, etc.
```javascript
scrape({
urls: ['http://example.com/'],
directory: '/path/to/save',
request: {
headers: {
'User-Agent': 'Mozilla/5.0 (Linux; Android 4.2.1; en-us; Nexus 4 Build/JOP40D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166 Mobile Safari/535.19'
}
}
}).then(function (result) {
console.log(result);
}).catch(function(err){
console.log(err);
});
}).then(console.log).catch(console.log);
```

#### Example 2. Recursive downloading
#### subdirectories
Array of objects, specifies subdirectories for file extensions. If `null` all files will be saved to `directory`.
```javascript
// Links from example.com will be followed
// Links from links will be ignored because theirs depth = 2 is greater than maxDepth
var scrape = require('website-scraper');
/* Separate files into directories:
- `img` for .jpg, .png, .svg (full path `/path/to/save/img`)
- `js` for .js (full path `/path/to/save/js`)
- `css` for .css (full path `/path/to/save/css`)
*/
scrape({
urls: ['http://example.com/'],
urls: ['http://example.com'],
directory: '/path/to/save',
recursive: true,
maxDepth: 1
subdirectories: [
{directory: 'img', extensions: ['.jpg', '.png', '.svg']},
{directory: 'js', extensions: ['.js']},
{directory: 'css', extensions: ['.css']}
]
}).then(console.log).catch(console.log);
```

#### Example 3. Filtering out external resources
#### defaultFilename
String, filename for index page. Defaults to `index.html`.

#### prettifyUrls
Boolean, whether urls should be 'prettified', by having the `defaultFilename` removed. Defaults to `false`.

#### ignoreErrors
Boolean, if `true` scraper will continue downloading resources after error occured, if `false` - scraper will finish process and return error. Defaults to `true`.

#### urlFilter
Function which is called for each url to check whether it should be scraped. Defaults to `null` - no url filter will be applied.
```javascript
// Links to other websites are filtered out by the urlFilter
var scrape = require('website-scraper');
Expand All @@ -166,28 +154,40 @@ scrape({
}).then(console.log).catch(console.log);
```

#### Example 4. Downloading an entire website
#### filenameGenerator
String, name of one of the bundled filenameGenerators, or a custom filenameGenerator function. Filename generator determines where the scraped files are saved.

###### byType (default)
When the `byType` filenameGenerator is used the downloaded files are saved by type (as defined by the `subdirectories` setting) or directly in the `directory` folder, if no subdirectory is specified for the specific type.

###### bySiteStructure
When the `bySiteStructure` filenameGenerator is used the downloaded files are saved in `directory` using same structure as on the website:
- `/` => `DIRECTORY/index.html`
- `/about` => `DIRECTORY/about/index.html`
- `/resources/javascript/libraries/jquery.min.js` => `DIRECTORY/resources/javascript/libraries/jquery.min.js`

```javascript
// Downloads all the crawlable files of example.com.
// The files are saved in the same structure as the structure of the website, by using the `bySiteStructure` filenameGenerator.
// Downloads all the crawlable files. The files are saved in the same structure as the structure of the website
// Links to other websites are filtered out by the urlFilter
var scrape = require('website-scraper');
scrape({
urls: ['http://example.com/'],
urlFilter: function(url){
return url.indexOf('http://example.com') === 0;
},
urlFilter: function(url){ return url.indexOf('http://example.com') === 0; },
recursive: true,
maxDepth: 100,
prettifyUrls: true,
filenameGenerator: 'bySiteStructure',
directory: '/path/to/save'
}).then(console.log).catch(console.log);
```

#### Example 5. Rejecting resources with 404 status and adding metadata
#### httpResponseHandler
Function which is called on each response, allows to customize resource or reject its downloading.
It takes 1 argument - response object of [request](https://github.com/request/request) module and should return resolved `Promise` if resource should be downloaded or rejected with Error `Promise` if it should be skipped.
Promise should be resolved with:
* `string` which contains response body
* or object with properies `body` (response body, string) and `metadata` - everything you want to save for this resource (like headers, original text, timestamps, etc.), scraper will not use this field at all, it is only for result.
```javascript
var scrape = require('website-scraper');
// Rejecting resources with 404 status and adding metadata to other resources
scrape({
urls: ['http://example.com/'],
directory: '/path/to/save',
Expand All @@ -207,6 +207,15 @@ scrape({
}
}).then(console.log).catch(console.log);
```
Scrape function resolves with array of [Resource](https://github.com/s0ph1e/node-website-scraper/blob/master/lib/resource.js) objects which contain `metadata` property from `httpResponseHandler`.

## callback
Callback function, optional, includes following parameters:
- `error`: if error - `Error` object, if success - `null`
- `result`: if error - `null`, if success - array of [Resource](https://github.com/s0ph1e/node-website-scraper/blob/master/lib/resource.js) objects containing:
- `url`: url of loaded page
- `filename`: filename where page was saved (relative to `directory`)
- `children`: array of children Resources

## Log and debug
This module uses [debug](https://github.com/visionmedia/debug) to log events. To enable logs you should use environment variable `DEBUG`.
Expand Down