Skip to content
Go to file

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

HTML Scraper

The scraper has three components: httpsplitextract, executed in the same order. You make an http request to fetch a page, split the page into different sections and ultimately extract the data from each section using a custom parser.

The best part about this scraper is that you can create a chain of actions that you need to perform.

Order of execution

httpsplitextracthttpsplit and so on…


Say I want extract al the information of students who got admitted to the University of Southern California, Los Angeles. I would do it as follows —

  1. Make an http request to this page —
  2. Page consists of multiple anchor tags containing links of each
	# Standard require
    Scraper = require 'HTML-Scraper'
    # Specify the key to read urls
    Scraper().http 'url'
    .split '.archive a'
    .extract (doc) ->
        href: "" +  doc.attr 'href'
        text: doc.html()
    .http 'href'
    .extract ($) ->
        http: $('a:nth-child(2)').attr('href')
    #Launch with base params
    .$launch  url: ''
    #Returns a promise
    .then (val) -> console.log val


Scrape data using method chaining


You can’t perform that action at this time.