NodeJS Harvester

An easily extended nodejs derived data harvester leveraging the cheerio framework.

This application will when given a series of url's alongside customizable DOM selectors specified in a csv file; parse the url's and write the generated output to a customizable database backend. The database backend is then served up in JSON via a RESTful route using the ExpressJS framework.

While parsing the url's any documents / media / images that are found will be downloaded to an appropriate media entity folder and the html markup served will be altered to instead be linked to the media entity with a customized UUID tag rather then a href tag. This approach helps to ensure a good data model when migrating content into the many different types of content management systems.

Installation

Install NodeJS via the traditional methods.
Run the following commands from working directory:

  npm install -g bower
  npm install -g grunt-cli
  npm install

Configure your config.yml file.
The following GruntJS tasks have been registered (grunt-contrib tasks linked below):
- JsHint
- JSCS
- Uglify
- Nodemon

  grunt harvest      // Expanded -> ['jshint', 'jscs', 'uglify', 'nodemon:harvest']
  grunt debug        // Expanded -> ['jshint', 'jscs', 'uglify', 'nodemon:debug']
  grunt export       // Expanded -> ['jshint', 'jscs', 'uglify', 'nodemon:export']
  grunt restore      // Expanded -> ['clean', 'jshint', 'jscs', 'uglify']
  grunt serve        // Expanded -> ['jshint', 'jscs', 'uglify', 'nodemon:serve']
  grunt readconfig
  grunt updateconfig

Grunt Tasks

Currently their are 6 (+1 debugger) configured grunt tasks to interact with the NodeJS Harvester.

Grunt Harvest: Runs the entire harvest converting a correctly formatted CSV into a live REST Route powered by a custom db layer.
Grunt Restore: Will wipe all retrieved data and bring the harvester back to pristine state.
Grunt Serve: Instantiates the REST routes based on an instantiated database layer (sqlite3 tested) from an earlier harvest.
Grunt Export: Instantiates the REST routes based on an instantiated database layer (sqlite3 tested).
Grunt ReadConfig: Output the current YAML configuration file to stdout.
Grunt UpdateConfig: Update YAML config file with UUID as param.

How it Works

In order to demonstrate how this library works we will be using the import.csv file supplied as part of this repository as an example.

Examining the first three records from the import.csv file we have the following (edited for readability):

id	website	language	pattern	title	body
1	`../index-en.html`	en	wxt3	css::#wb-cont	css::#wb-main-in\|h1\|#wet-date-mod
2	`../cont-en.html`	en	wxt3	css::#wb-cont	css::#wb-main-in\|h1\|#wet-date-mod
3	`../grids-en.html`	en	wxt3	css::#wb-cont	css::#wb-main-in\|h1\|#wet-date-mod

Running grunt harvest on this import.csv file you will end with an REST Route (among others) that will serve up that content via JSON (edited for readability):

{
  "rows": [
    {
      "id": 1,
      "website": "http://wet-boew.github.io/wet-boew/index-en.html",
      "language": "en",
      "pattern": "wxt3",
      "title": "Web Experience Toolkit (WET)",
      "body": "\n<!-- MainContentStart -->\n\n\n<section><h2 id=\"about\">What is the Web Experience Toolkit?</h2>...</section></div>\n<!-- MainContentEnd -->\n"
    },
    {
      "id": 2,
      "website": "http://wet-boew.github.io/wet-boew/demos/theme-wet-boew/cont-en.html",
      "language": "en",
      "pattern": "wxt3",
      "title": "Content page - WET theme",
      "body": "\n<!-- MainContentStart -->\n\n\n<section><h2>Heading 2 (<code>h2</code>) - default appearance</h2>...<section></div>\n<!-- MainContentEnd -->\n"
    },
    {
      "id": 3,
      "website": "http://wet-boew.github.io/wet-boew/demos/grids/grid-base-en.html",
      "language": "en",
      "pattern": "wxt3",
      "title": "Grid system",
      "body": "\n<!-- MainContentStart -->\n\n\n<section><div class=\"wet-boew-prettify all-pre linenums\">...<section></div>\n<!-- MainContentEnd -->\n"
    },
  ],
  "rowCount": 3
}

After inspecting the JSON file above you will notice that our CSS selectors got converted from their DOM related counterparts into the actual html source. This conversion and a lot more happens when the harvest parses the data it receives.

CSS3 DOM Selectors

NodeJS Harvester leverages cheerio which gives you a fast, flexible, and lean implementation of core jQuery implemented for the server.

Cheerio allows for you to use related CSS3 Selectors to assist in your mappings among many other helper functions.

Based on the range of selectors we can use a more complicated query could be something like follows:

css::body table:nth-child(3) tr td:nth-child(4) form[name~=content]

Tokens

Aside from the css:: token a range of other tokens can be added to the import.csv and will have additional logic attached to them.

The following is the current table of stable tokens along with their intended result.

Token	Description
`css`	Leverages CSS Selectors to extract markup.
`date`	Converts a wide array of date strings into unix time leveraging MomentJS.
`taxonomy`	Support for a list of items separated by commas.

Token Parameters

Currently the css token described above can have special parameters passed to it.

The following is the current table discussion this parameters along with their intended result.

Parameter	Description
`	`
`~`	Allows for you to pass a Cheerio Operator that the nodejs script will run as a callback.

Based on the parameters above a query could resemble the following:

css::body table .section|h1|parent()

The above simply states grab parent element of a query selector of 'body table .section' and then remove h1 from the result.

Rest Routes

When the import.csv file is processed data is written to the database based on how the config.yml file was created.

Most of the defined tables inside the sqlite database are passed to Express in order to rendered as REST Routes.

For every defined schema type in the config.yml file the following rest routes will be made (only global ones shown below):

Route	Description
`/[type]`	Will render all limited subset of all records imported to verify no row missed.
`/[type]/:id`	Renders the full entity of an individual imported row for every defined csv field.
`/join/[type]`	In Development.
`/join/[type]/:id`	In Development.
`/join/language/[type]`	In Development.
`/all/[type]`	Renders 50 full entities of imported rows for every defined csv field.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
config		config
import		import
lib		lib
public/img		public/img
views		views
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.jscsrc		.jscsrc
.jshintrc		.jshintrc
.npmignore		.npmignore
.travis.yml		.travis.yml
Dockerfile		Dockerfile
Gruntfile.coffee		Gruntfile.coffee
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
package.json		package.json
yarn.lock		yarn.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NodeJS Harvester

Installation

Grunt Tasks

How it Works

CSS3 DOM Selectors

Tokens

Token Parameters

Rest Routes

About

Releases

Packages

Languages

License

sylus/nodejs-harvester

Folders and files

Latest commit

History

Repository files navigation

NodeJS Harvester

Installation

Grunt Tasks

How it Works

CSS3 DOM Selectors

Tokens

Token Parameters

Rest Routes

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages