No description or website provided.
JavaScript HTML CSS
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
client/public
server
.gitignore
package.json
readme.md

readme.md

Web Scraping

What is it?

  • Computer software technique of automatically collecting and extracting information from websites.

  • Any content that can be viewed on a webpage can be scraped. Period.

  • Before web based API’s became prominent we had web scraping.

Think About It

  • Site owners typically care more about maintaining their public-facing visitor site than about their structured data feeds.

  • Most businesses don’t build a lot of defenses against automated access.

  • With APIs, you often have to register to get a key and then send that key with every request. With simple HTTP requests, you’re basically anonymous besides IP address and cookies, which can be spoofed.

  • Web scraping is universally available. You don’t have to wait for a site to open up an API or contact anyone at the organization. Just spend time browsing the site until you find the data you need and figure out some basic access patterns.

Scraping with Node

(Express,Request,Cheerio)

Cheerio - http://cheeriojs.github.io/cheerio/

  • Implementation of core jQuery specifically for the server
  • Helps us traverse the DOM and extract data

Require Dependencies

var express = require('express');
var fs = require('fs');
var request = require('request');
var cheerio = require('cheerio');
var app     = express();

Make GET request

app.get('/', function (req, res) {
// request to url and load html
});

Start Server and Export Module

app.listen('8081');
exports = module.exports = app;

Request to a URL and load HTML

request(url, function (err, res, html) {
    if (!err && res.statusCode == 200) {
      // pass DOM to cheerio
      var $ = cheerio.load(html);
    console.log(html);
  });