The scraper has three components:
extract, executed in the same order. You make an
http request to fetch a page,
split the page into different sections and ultimately
extract the data from each section using a custom parser.
The best part about this scraper is that you can create a chain of actions that you need to perform.
Order of execution
split and so on…
Say I want extract al the information of students who got admitted to the University of Southern California, Los Angeles. I would do it as follows —
- Make an http request to this page — http://edulix.com/universityfinder/university_of_southern_california.
- Page consists of multiple anchor tags containing links of each
# Standard require Scraper = require 'HTML-Scraper' # Specify the key to read urls Scraper().http 'url' .split '.archive a' .extract (doc) -> href: "http://tusharm.com" + doc.attr 'href' text: doc.html() .http 'href' .extract ($) -> http: $('a:nth-child(2)').attr('href') #Launch with base params .$launch url: 'http://tusharm.com/projects.html' #Returns a promise .then (val) -> console.log val .done()