I've been scraping some sites recently for a side-project, and it occurred to me that I could save the target site bandwidth (and probably speed up the scrape process) if I check the google cache first, and only scrape the site itself as a last resort.
Thus was fromthecache.com born. When you enter a URL, it will return the cached version from google if available; if not, it will scrape the original site. It stores the resulting page in memcached for 30 minutes, so popular requests return instantly.
Another use is an easy way to post mirror links when a site goes down - simply put fromthecache.com/ in front of the URL and you have an instant cache link.
You can try out a live demo of the project at fromthecache.com.
I'm not sure about the legalities of removing the google cache header text (I do link to the google cache result though), and I have a suspicion the site may end up blocked from google anyway for looking like a bot.
Feel free to call the service programmatically if you wish, and basically do whatever you want with it. If anyone has complaints or comments, please drop me a line
This project is distributed under the MIT License. See the License file for details.
To get started, first download the source via git
> git clone git://github.com/recurser/from-the-cache.git
> cd from-the-cache
Next, install the requisite gems :
> gem install bundler
> bundle install
Run memcached if available - the app will cache requests for 30 minutes by default. You'll need to run memcached to use the test suite, but it's not strictly necessary just to run the app.
> memcached -d
Finally, run the local development server to try it out :
> rails s
The demo application should now be available at http://localhost:3000/
From The Cache comes pre-configured for Spork and autotest support. I generally work by running spork in one terminal :
> cd from-the-cache
> spork
Using RSpec
Loading Spork.prefork block...
Spork is ready and listening on 8989!
... and running autotest in another :
> cd from-the-cache
> autotest
........................................................................................
Finished in 29.27 seconds
41 examples, 0 failures
Autotest will run the test suite automatically whenever you save changes, and if you're working on OSX, it will provide Growl feedback every time the test suite is run :
To deploy the application to Heroku simply run heroku create :
> heroku create
Creating evening-beach-14... done
Created http://evening-beach-14.heroku.com/ | git@heroku.com:evening-beach-14.git
Git remote heroku added
The evening-beach-14 part will vary depending on the name Heroku chooses for your application.
To push your newly created application to Heroku, do a git push heroku master :
> git push heroku master
Counting objects: 1669, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (629/629), done.
Writing objects: 100% (1669/1669), 382.36 KiB, done.
Total 1669 (delta 955), reused 1657 (delta 949)
-----> Heroku receiving push
-----> Rails app detected
-----> Gemfile detected, running Bundler version 1.0.0
Unresolved dependencies detected; Installing...
Fetching source index for http://rubygems.org/
...
[installing a bunch of gems]
...
Your bundle is complete! Use `bundle show [gemname]` to see where a bundled gem is installed.
Your bundle was installed to `.bundle/gems`
Compiled slug size is 25.6MB
-----> Launching.... done
http://evening-beach-14.heroku.com deployed to Heroku
To git@heroku.com:evening-beach-14.git
* [new branch] master -> master
You should also install the free SendGrid add-on for email delivery :
> heroku addons:add sendgrid:free
And finally the free memcached add-on :
> heroku addons:add memcache:5mb
Your application should now be available at http://evening-beach-14.heroku.com/ (substitute the domain you received from git push heroku here).
Whenever you push changes to git, you can update heroku by doing git push heroku again.
CoffeeScript and Compass both require generated files to be saved when they're compiled - this causes a problem on Heroku because access to the filesystem is limited. There are various hacks to get around this by saving to the tmp folder and re-routing requests, but I decided it was probably easiest to just add the generated files to git and deploy them normally.
To achieve this, I added a post-commit hook to the repository to generate these files whenever changes are committed. To add these, create the file .git/hooks/pre-commit , make it executable, and add the following contents :
#!/bin/sh
compass compile
rake public/javascripts/application.js
jammit
git add public/assets/common*
git add public/javascripts/application.js
The first two commands generate the CSS and Javascript respectively, and the 3rd command packages them up using Jammit.
Stylesheets in the public/stylesheets folder are automatically generated by compass, so any changes you make to these files will be lost. Instead, you should edit the sass files in app/stylesheets.
When altering stylesheets during development, you should run compass watch to make sure your changes are automatically compiled to public/stylesheets :
> compass watch
>>> Compass is watching for changes. Press Ctrl-C to Stop.
Similarly, public/javascripts/application.js is automatically generated by the CoffeeScript compiler. Instead of editing it directly, edit application/scripts/application.coffee instead.
Unlike Compass, there is no need to run a watch script for this file during development - it will automatically be compiled for you.
JavaScript and CSS are automatically compressed and packaged for production with Jammit. During development, the non-compressed versions will be served to speed things up. This packaging should be fairly transparent to you if you set up the git pre-commit hook described above - if you choose not to do this you will need to run the jammit command manually before committing changes.
If you come across any problems, please create a ticket and we'll try to get it fixed as soon as possible.
Once you've made your commits:
- Fork from-the-cache
- Create a topic branch -
git checkout -b my_branch
- Push to your branch -
git push origin my_branch
- Create a Pull Request from your branch
- That's it!
Dave Perrett :: mail@recursive-design.com :: @recurser
Copyright (c) 2010 Dave Perrett. See License for details.