SideCrawl is a simple web spider extensible (via Module) written with Goliath (EventMachine/Ruby).
Ruby
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
app
modules
.env
.gitignore
.irbrc
.ruby-gemset
.ruby-version
Gemfile
Gemfile.lock
Procfile
README.md
Rakefile
app.rb

README.md

SideCrawl

SideCrawl is a simple web spider extensible (via Module) written with Goliath (EventMachine/Ruby). It gives you the full power of jQuery like (via nokogiri) on the server to parse a big number of pages asynchronously.

Prerequisites

You need to have rvm.

Setup Instructions

$ rvm install 2.0
$ bundle install

Getting Started

Create module

To define rules to retrieve the page elements - you need to create a module. Sidecrawl use sitemap for crawling but you can override easily. See below example

# encoding: utf-8

module Amazon

  module WebsiteSetting
    def init
      @name = "Amazon"
      @description = "Amazon.com"
      @website_url = 'http://www.amazon.com'
      @sources = %w{
        http://www.amazon.com/sitemap_vendor_videos_us.xml
      }
    end
  end

  module PageSetting
    attr_accessor :name, :description, :pictures, :price

    def parse
      @name = @html_doc.at_css('#aiv-content-title').text.strip rescue nil
      @description = @html_doc.at_css('.dv-simple-synopsis').text.strip rescue nil
      @pictures = @html_doc.at_css('.dp-img-bracket img')[:src] rescue nil
      @price = @html_doc.at_css('.dv-button-inner').text.strip.scan(/[0-9]+/).join('.').to_f rescue nil
    end
  end

end

Output

You can change the output format page simply by changing the view (written in RABL).

object @page

attributes :name, :description, :pictures, :price

Environment variables

You can specify environment variables in the file .env

Variables Descriptions
PORT Listening ports
SERVER_URL URL server
RECEIVER_URL URL server receiver
TIMEOUT Timeout
CONCURRENCY_SOURCE Concurrency source
CONCURRENCY_PAGE Concurrency page

Run sidecrawl

Sidecrawl uses foreman. You can specified the number of each process type to run (e.g. web=8). Check out the foreman documentation

$ foreman start web=4

Sidecrawl Guide

Sidecrawl has an API to show the results.

Crawling a website

You can run a crawl task via a rake. See below example

$ rake crawl['amazon']

Performance: MRI, JRuby, Rubinius

SideCrawl isn't tied to a single Ruby runtime - it is able to run on MRI Ruby, JRuby and Rubinius today. Depending on which platform you are working with, you will see different performance characteristics.