Fetching latest commit…
Cannot retrieve the latest commit at this time.
|Failed to load latest commit information.|
The Amazing Amazon Scraping Spider My newest adventure in parsing stuff. Nokogiri to crawl across amazon! Spiders are fun to write, but Amazon is a pain to harvest. Amazon is not keen on id's for the elements of the page, so there's nothing to shout out on the page that, Hey, this is a review! I have to brute force my way across the page and look for 'review-like' content. These smarts are encapsulated in the Amazon::Review module. Take a look in there to see what structure I'm looking for. Take away: very brittle search. Hopefully they don't change structure very often. Features * Recreate social graph of amazon.com reviews * Trivially scalable through Redis * Smart extractor (may or may not need wiggle room, in progress) * Multithreaded with clean shutdown operation * Outputs to any database Sequel supports (not yet!) Tools used Mechanize Nokogiri Sequel Redis Usage To kick off the crawlers you must add a starter page. Depending on where you want your data to come from first (although no guarantees), choose a page in the general area. But to start at square one: redis-cli -h your_host -p your_port -a your_pass sadd href:unvisited http://amazon.com/ Now edit config.yaml to spawn as many workers as you'd like and kick them off ./spider.rb config.yaml Environment I have been developing this on rubinius with the hopes of moving to hydra. So use rvm to get that going: http://rubini.us/2011/02/22/rubinius-multiple-branches-with-rvm/ And don't forget to 'bundle install' to fetch all your required gems! Trivially Scalable? Yes! The spider uses a stack for search but the trick is now we're using a Redis list. Visited URLs are hashed and placed in a used set. If you get the size of this set you have a good idea of how many pages have been visited. As long as the spider can visit Redis it can contribute to the search process! No security in place, but you can put redis behind a firewall and use ssh reverse proxying si necessaire. Tips and Tricks When writing extraction code: * Do not use bang methods (String#strip!, String#gsub!, etc) because they will return nil if no changes are made. I guess it's good to know if there were inplace modifications or not, but in my rolling integration system this will propagate a nil and give you weird behavior. Bugs * I don't think mechanize was built with crawling in mind -- I think it keeps an infinite history of pages visited. ** I have already updated mechanize for links, now to do this. * Because most URLs are root-relative, a new crawler fetching entry from stack will not know what host to connect to. ** Probably have the first URL parsed by ruby's uri lib. If the uri has no host, just set it amazon.com.