Fluent Phantom

A fluent interface for scraping web content in Node with the PhantomJS headless browser. Its most notable feature is that, similar to waitFor.js, you can wait until the availability of content on a page, which comes in handy when scraping content generated by AJAX requests.

Installation

Install via npm with:

npm install fluent-phantom

Note that this package depends on the PhantomJS bridge for Node, which assumes that you have already installed PhantomJS.

Basic Example

The most likely scenario for usage – scraping content dynamically generated by the client using a CSS selector – is outlined in CoffeeScript below. For more information, read the rest of this document.

phantom = require 'fluent-phantom'`
phantom.create()
	.select('#headlines h3')
	.with().members('innerText')
	.from('http://example.com/news')
	.and().process((elements) -> 
		for element in elements
			new Headline(element.innerText).save()
	)
	.until(5000) # Timeout after 5 seconds
	.otherwise(-> 
		throw Error "Headlines were never loaded"
	)
	.execute()

Overview

The core of this package is two classes: a Request class that wraps the PhantomJS bridge and a fluent Builder to simplify its use.

Connection strategies
Builder
Request
- Setters and getters
- Debugging
- Events

Connection Strategies

By default, a new Phantom object is created for each request, but this can be wasteful. In v1.1.0, alternative connection strategies were introduced.

Those strategies are:

NewPhantom: Creates a new Phantom for each request. [default]
NewPhantomAndPort: Creates a new Phantom for each request with a new port. This ensures no EADDRINUSE conflicts, but could create an infinite number of phantoms and so is discouraged. If used, consider automatically closing connections when finished using Request.closeWhenFinished(true).
RecycledPhantom: Recycles the same Phantom connection for each request. This is the simplest way to avoid an EADDRINUSE bug if making many requests.
RoundRobin(poolSize, [minimum], [queueDepth]): Accepts poolSize as the number of PhantomJS objects in the pool, min as the number of PhantomJS objects to initialize beforehand, and queueDepth as the number of enqueued functions to allow to wait for a ready connection.
RandomPool(poolSize, [queueDepth]): Accepts the pool size and queueDepth as constructor arguments with the same meaning as RoundRobin and selects a connection at random.

Changing the connection strategy

A package-level method, fluent-phantom.setConnectionStrategy(strategy), can be invoked with a new instance of one of the connection strategies above to change the strategy.

fluent = require 'fluent-phantom'
fluent.setConnectionStrategy new fluent.ConnectionStrategy.RoundRobin(10, 2)

Builder

Initializing a builder with `create()`

Builders can be created with new, but using the package-level create() helper is encouraged, as in the following example:

phantom = require 'fluent-phantom'
builder = phantom.create()

Declaring a URL with `from()`

Builder.from(url: string)

Synonyms: url()

Sets the request URL.

phantom.create().from('http://example.com/')

Scraping content with CSS selectors

The easiest way to scrape content is to call select() with a CSS selector and let the library generate your scraping code for you. This will automatically wait until the number of elements found with this selector is greater than zero by immediately invoking when(). Optionally, the members of nodes in the result set can be limited using properties().

Here is an example of extracting all h3 elements in a container with the id 'headlines':

phantom.create()
	.select('#headlines h3')
	.from('http://example.com/news')
	.and().process((elements) -> console.log "Handle results here", elements)
	.with().members('innerHTML', 'className', 'id')
	.until(5000) # Timeout after 5 seconds
	.otherwise(-> console.error "The selector was never satisfied; handle errors here")
	.execute()

Builder.select(selector: string|array|object, [minimum: number])

Synonyms: extract()

See also: when(), properties(), handle()

When select is invoked with a string, it is assumed to be a CSS selector describing elements to be scraped. When used in this way, select will immediately invoke when to automatically wait for the selector to be satisfied. If a number is passed as the second argument, it is treated as a minimum number of elements that must be satisfied and passed to when.

Note that WebKit's document.querySelectorAll() is used to interpret selectors.

If an array or map of strings is provided, it is assumed that multiple selections are desired, and an array or map of the corresponding results will be returned. Index ordering of arrays will be preserved.

Builder.handle(callback: function)

Synonyms: process(), receive()

Specify a callback function to process results. Because scraping performed by select happens inside the PhantomJS VM, the results must be serialized to JSON and processed by a function in the Node.js VM where this module is being used. This is where you would publish results to a message queue or persist them to a data store.

The callback will receive two arguments. The first will be an array of results – even if there is only one result in the set. The second will be a reference to the page object.

Builder.properties(property: string, [property2: string], [...]) or Builder.properties(properties: array)

Synonyms: members()

By default, elements in the result set are stripped down to a handful of properties: id, children, className, innerHTML, and a few others. This list can be expanded or contracted by invoking the properties method with a list of property names as strings. They should be valid DOM properties.

Scraping content with functions

Sometimes, a simple CSS selector is not enough to describe the content you want to scrape. The Builder provides several alternatives that allow greater expression.

Builder.select(extractor: function, [argument: any])

Synonyms: extract()

Builder.evaluate(extractor: function, handler: function, [argument: any])

Builder.run(function)

Synonyms: invoke()

For those who already know what they're doing with PhantomJS and only want the convenience of using this module to wait for content to be available, this method accepts a function that receives a PhantomJS page object as its only argument for use with, for instance, page.evaluate. Use of this is exclusionary to the use of select and evaluate.

Waiting for content

With CSS selectors

The two forms of when cause any actions specified by handle, evaluate, or run to be suspended until a sentry condition is satisfied or the timeout period set with [timeout] has ellapsed. The author's original motivation for using PhantomJS over HTMLUnit or Beautiful Soup was to scrape content generated client-side via a long-polling mechanism. The when method is ideal for this as it delays scraping until the content to be scraped is available.

When a sentry condition has been set, the Request will test for it immediately and every ~250ms afterwards until it has been satisfied or the test times out. At each check, the Request will emit the 'checking' event.

Builder.when(selector: string, [minimum: number])

With functions

Builder.when(sentry: function, [argument: any])

Controlling timeouts and errors

Builders provide support for setting bounds on the time to wait for content to become ready.

Builder.otherwise(callback: function)

Set a callback to be invoked when the request times out while waiting for content to become available. The callback will receive no arguments.

Builder.timeout(milliseconds: number)

Synonyms: for(), until()

Set the duration to wait before timing out. Argument should be a number of milliseconds as an integer, a string that can be parsed by Duration (e.g. "5s"), or a Duration object.

Builder.forever()

Allow a request to wait forever for content.

Builder.immediately()

Causes the request to test its condition once, but only once.

Filler terms

For expressiveness, several meaningless terms exist in the builder grammar. They are:

and()
then()
of()
with()

These terms provide a more fluent feel. For instance, compare the following two examples:

# Vanilla
phantom.create()
	.select('div.feature a')
	.process(-> ...)

phantom.create()
# Slightly more expressive
	.select('div.feature a')
	.and().then().process(-> ...)

The difference is small, but available if desired.

Finishing with `build()` and `execute()`

Once a builder has received all input to create a request, you can build it using build() or immediately execute (and return) it with execute().

Builder.build()

Builds a Request object to your specifications.

Builder.execute()

Builds and immediately executes a new Request object. The Request is returned.

Request

Documentation coming soon. In the meantime, view the annotated source in its raw form or prettied up by docco.

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
docs		docs
test		test
.gitignore		.gitignore
Gruntfile.coffee		Gruntfile.coffee
README.md		README.md
index.coffee		index.coffee
index.js		index.js
package.json		package.json

rawg/fluent-phantom

Folders and files

Latest commit

History

Repository files navigation