Skip to content

tmpfs/wget-parser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Table of Contents

Spider parser

Build Status npm version Coverage Status.

Parses the spider output from wget into an object structure of links.

This object could then be processed further to create a tree structure of the hierarchy of a website such that sitemap generation could be implemented.

Tested using wget v1.15 on linux.

Usage

var parser = require('wget-parser')
  , buf = new Buffer(0);      // buffer should contain the spider output
console.dir(parser(buf));
  • parser.Parser: The parser class.
  • parser.Link: The class that represents a link.
  • parser.ParseStream: Parse stream class.

Streams support is available, see the test spec for example usage.

wget-parser

A program that reads from stdin and prints the result of the parse as JSON, exits with error code 1 if any broken links are found.

cat test/fixtures/mock.txt | wget-parser
cat test/fixtures/broken.txt | wget-parser; echo $?;

wget-spider

A program that performs a spider with wget and pipes the output to wget-parser:

wget-spider http://google.com

Output

Example output from the parser:

{
  "links": [
    {
      "url": {
        "protocol": "http:",
        "slashes": true,
        "auth": null,
        "host": "google.com",
        "port": null,
        "hostname": "google.com",
        "hash": null,
        "search": null,
        "query": null,
        "pathname": "/",
        "path": "/",
        "href": "http://google.com/"
      },
      "link": "http://google.com/",
      "line": "--2016-02-10 16:11:57--  http://google.com/"
    },
    {
      "url": {
        "protocol": "http:",
        "slashes": true,
        "auth": null,
        "host": "www.google.co.id",
        "port": null,
        "hostname": "www.google.co.id",
        "hash": null,
        "search": "?gws_rd=cr&ei=zfC6Vv6KKYexuATc3pu4DQ",
        "query": "gws_rd=cr&ei=zfC6Vv6KKYexuATc3pu4DQ",
        "pathname": "/",
        "path": "/?gws_rd=cr&ei=zfC6Vv6KKYexuATc3pu4DQ",
        "href": "http://www.google.co.id/?gws_rd=cr&ei=zfC6Vv6KKYexuATc3pu4DQ"
      },
      "link": "http://www.google.co.id/?gws_rd=cr&ei=zfC6Vv6KKYexuATc3pu4DQ",
      "line": "--2016-02-10 16:11:57--  http://www.google.co.id/?gws_rd=cr&ei=zfC6Vv6KKYexuATc3pu4DQ"
    }
  ],
  "broken": []
}

Developer

Test

To run the test suite:

npm test

Cover

To generate code coverage run:

npm run cover

Lint

Run the source tree through jshint and jscs:

npm run lint

Clean

Remove generated files:

npm run clean

Readme

To build the readme file from the partial definitions:

npm run readme

Generated by mdp(1).

About

Parses the wget spider output

Resources

Stars

Watchers

Forks

Packages

No packages published