Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Use your Google sitemap XML file to download a copy of your website from the Google Cache
branch: master

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.
README
google_cache_sitemap_downloader.rb

README

Ruby script to recover HTML/text content from your site that has suffered a catastrophic database failure.

This script assumes you have a Google Sitemap file, usually at <yourdomain>/sitemap.xml

Requires the following gems:

* crack
* nokogiri

sudo gem install nokogiri
sudo gem install jnunemaker-crack -s http://gems.github.com
git clone git://github.com/steflewandowski/Google-Cache-Sitemap-Downloader.git
cd Google-Cache-Sitemap-Downloader
ruby google_cache_sitemap_downloader.rb 'http://yourdomain.com' path/to/sitemap.xml 'your_css_selector'

Eg. google_cache_sitemap_downloader.rb 'http://steflewandowski.com' ~/Desktop/sitemap.xml 'body'

The last item is a CSS selector to check that the page is valid. This uses standard CSS notation to check if the specified element exists, otherwise it will not write the file. Examples: 'body', 'h1#logo', '#main_menu.navigation' and so on.

The Google cache result for each url linked in your sitemap will then be downloaded into a directory 'results' as html, with slashes replaced with '_-_'

Google will rate-limit you and potentially complain that you are using an automated script. If that happens, the script will exit and you will have to wait some time before Google allows you to use the cache again. 

The script waits 2 seconds in between requests by default to attempt to avoid this.
Something went wrong with that request. Please try again.