Skip to content
This repository has been archived by the owner on Feb 24, 2021. It is now read-only.
/ from-the-cache Public archive

A simple rails app that returns the google-cache of a URL if available, or the original site otherwise.

License

Notifications You must be signed in to change notification settings

recurser/from-the-cache

Repository files navigation

About

I've been scraping some sites recently for a side-project, and it occurred to me that I could save the target site bandwidth (and probably speed up the scrape process) if I check the google cache first, and only scrape the site itself as a last resort.

Thus was fromthecache.com born. When you enter a URL, it will return the cached version from google if available; if not, it will scrape the original site. It stores the resulting page in memcached for 30 minutes, so popular requests return instantly.

Another use is an easy way to post mirror links when a site goes down - simply put fromthecache.com/ in front of the URL and you have an instant cache link.

Demo Site

You can try out a live demo of the project at fromthecache.com.

I'm not sure about the legalities of removing the google cache header text (I do link to the google cache result though), and I have a suspicion the site may end up blocked from google anyway for looking like a bot.

Feel free to call the service programmatically if you wish, and basically do whatever you want with it. If anyone has complaints or comments, please drop me a line

License

This project is distributed under the MIT License. See the License file for details.

Installation

To get started, first download the source via git

> git clone git://github.com/recurser/from-the-cache.git
> cd from-the-cache

Next, install the requisite gems :

> gem install bundler
> bundle install

Run memcached if available - the app will cache requests for 30 minutes by default. You'll need to run memcached to use the test suite, but it's not strictly necessary just to run the app.

> memcached -d

Finally, run the local development server to try it out :

> rails s

The demo application should now be available at http://localhost:3000/

Testing

From The Cache comes pre-configured for Spork and autotest support. I generally work by running spork in one terminal :

> cd from-the-cache
> spork
Using RSpec
Loading Spork.prefork block...
Spork is ready and listening on 8989!

... and running autotest in another :

> cd from-the-cache
> autotest
........................................................................................

Finished in 29.27 seconds
41 examples, 0 failures

Autotest will run the test suite automatically whenever you save changes, and if you're working on OSX, it will provide Growl feedback every time the test suite is run :

Autotest growl notification

Deploying to Heroku

To deploy the application to Heroku simply run heroku create :

> heroku create
Creating evening-beach-14... done
Created http://evening-beach-14.heroku.com/ | git@heroku.com:evening-beach-14.git
Git remote heroku added

The evening-beach-14 part will vary depending on the name Heroku chooses for your application.

To push your newly created application to Heroku, do a git push heroku master :

> git push heroku master
Counting objects: 1669, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (629/629), done.
Writing objects: 100% (1669/1669), 382.36 KiB, done.
Total 1669 (delta 955), reused 1657 (delta 949)

-----> Heroku receiving push
-----> Rails app detected
-----> Gemfile detected, running Bundler version 1.0.0
       Unresolved dependencies detected; Installing...
       Fetching source index for http://rubygems.org/
       
       ...
       [installing a bunch of gems]
       ...
       
       Your bundle is complete! Use `bundle show [gemname]` to see where a bundled gem is installed.
       
       Your bundle was installed to `.bundle/gems`
       Compiled slug size is 25.6MB
-----> Launching.... done
       http://evening-beach-14.heroku.com deployed to Heroku

To git@heroku.com:evening-beach-14.git
 * [new branch]      master -> master

You should also install the free SendGrid add-on for email delivery :

> heroku addons:add sendgrid:free

And finally the free memcached add-on :

> heroku addons:add memcache:5mb

Your application should now be available at http://evening-beach-14.heroku.com/ (substitute the domain you received from git push heroku here).

Whenever you push changes to git, you can update heroku by doing git push heroku again.

Git hooks

CoffeeScript and Compass both require generated files to be saved when they're compiled - this causes a problem on Heroku because access to the filesystem is limited. There are various hacks to get around this by saving to the tmp folder and re-routing requests, but I decided it was probably easiest to just add the generated files to git and deploy them normally.

To achieve this, I added a post-commit hook to the repository to generate these files whenever changes are committed. To add these, create the file .git/hooks/pre-commit , make it executable, and add the following contents :

#!/bin/sh

compass compile
rake public/javascripts/application.js
jammit

git add public/assets/common*
git add public/javascripts/application.js

The first two commands generate the CSS and Javascript respectively, and the 3rd command packages them up using Jammit.

Stylesheets

Stylesheets in the public/stylesheets folder are automatically generated by compass, so any changes you make to these files will be lost. Instead, you should edit the sass files in app/stylesheets.

When altering stylesheets during development, you should run compass watch to make sure your changes are automatically compiled to public/stylesheets :

> compass watch
>>> Compass is watching for changes. Press Ctrl-C to Stop.

Javascript

Similarly, public/javascripts/application.js is automatically generated by the CoffeeScript compiler. Instead of editing it directly, edit application/scripts/application.coffee instead.

Unlike Compass, there is no need to run a watch script for this file during development - it will automatically be compiled for you.

Asset Packaging

JavaScript and CSS are automatically compressed and packaged for production with Jammit. During development, the non-compressed versions will be served to speed things up. This packaging should be fairly transparent to you if you set up the git pre-commit hook described above - if you choose not to do this you will need to run the jammit command manually before committing changes.

Bug Reports

If you come across any problems, please create a ticket and we'll try to get it fixed as soon as possible.

Contributing

Once you've made your commits:

  1. Fork from-the-cache
  2. Create a topic branch - git checkout -b my_branch
  3. Push to your branch - git push origin my_branch
  4. Create a Pull Request from your branch
  5. That's it!

Author

Dave Perrett :: mail@recursive-design.com :: @recurser

Copyright

Copyright (c) 2010 Dave Perrett. See License for details.

About

A simple rails app that returns the google-cache of a URL if available, or the original site otherwise.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published