# rmoorman/gwern.net forked from briansniffen/gwern.net

Switch branches/tags
Nothing to show
Fetching contributors…
Cannot retrieve contributors at this time
179 lines (112 sloc) 20.6 KB
 --- description: Archiving the Web, because nothing lasts forever - pulling together all the previous archive tools. ... > "Decay is inherent in all compound things. Work out your own salvation with diligence."^[Last words of the Buddha.] Given my interest in [long term content](About#long-content) and extensive linking, [link rot](!Wikipedia) is an issue of deep concern to me. I need backups not just for my files[^backups], but for the web pages I read and use - they're all part of my [exomind](!Wikipedia). It's not much good to have an extensive essay on some topic where half the links are dead and the reader can neither verify my claims nor get context for my claims. [^backups]: I use [duplicity](http://duplicity.nongnu.org/) to backup my entire home directory to a cheap 1.5TB hard drive (bought from Newegg using ). I used to semiannually tar up my important folders, add [PAR2](!Wikipedia) redundancy, and burn them to DVD, but that's no longer really feasible; if I ever get a Blu-ray burner, I'll resume WORM backups. (Magnetic media doesn't strike me as reliable over many decades, and it would ease my mind to have optical backups.) The statistics are dismal. To quote [Wikipedia](!Wikipedia "Link rot#Prevalence"): > "In a 2003 experiment, [Fetterly et al.](http://www2003.org/cdrom/papers/refereed/p097/P97%20sources/p97-fetterly.html) discovered that about one link out of every 200 disappeared each week from the Internet. [McCown et al. (2005)](http://iwaw.europarchive.org/05/papers/iwaw05-mccown1.pdf) discovered that half of the URLs cited in [D-Lib Magazine](!Wikipedia) articles were no longer accessible 10 years after publication [the irony!], and other studies have shown link rot in academic literature to be even worse ([Spinellis, 2003](http://www.spinellis.gr/pubs/jrnl/2003-CACM-URLcite/html/urlcite.html), [Lawrence et al., 2001](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.97.9695&rep=rep1&type=pdf)). [Nelson and Allen (2002)](http://www.dlib.org/dlib/january02/nelson/01nelson.html) examined link rot in digital libraries and found that about 3% of the objects were no longer accessible after one year." [Bruce Schneier](!Wikipedia) remarks that one friend experienced 50% linkrot in one of his pages over less than 9 years (not that the situation was any better [in 1998](http://www.pantos.org/atw/35654.html)), and that his own blog posts link to news articles that go dead in days[^schneier]; the [Internet Archive](!Wikipedia) has estimated the average lifespan of a Web page at [100 days](http://www.wired.com/culture/lifestyle/news/2001/10/47894). A _[Science](!Wikipedia "Science (journal)")_ study looked at articles in prestigious journals; they didn't use many Internet links, but when they did, 2 years later ~13% were dead[^science]. The French company Linterweb studied external links on the [French Wikipedia](!Wikipedia) before setting up [their cache](http://www.wikiwix.com/) of French external links, and found - back in 2008 - already [5% were dead](http://fr.wikipedia.org/wiki/Utilisateur:Pmartin/Cache). (The English Wikipedia has seen a 2010-2011 spike from a few thousand dead links to [~110,000](!Wikipedia "File:Articles-w-Dead-Links-Jan-2011.png") out of [~17.5m live links](!Wikipedia "Wikipedia talk:WikiProject External links/Webcitebot2#Summary").) The dismal studies [just](http://jnci.oxfordjournals.org/content/96/12/969.full) [go](/docs/2007-dimitrova.pdf) [on](/docs/2008-wren.pdf) [and](http://archderm.ama-assn.org/cgi/reprint/142/9/1147.pdf) [on](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2213465/) [and](http://jama.ama-assn.org/content/292/22/2723.3.full) [on](http://www.fasebj.org/content/19/14/1943.full) ([and on](http://www.srlst.com/ijist/V8N2/ijism-V8N2_files/ijism82-57-74.pdf)). Even in a highly stable, funded, curated environment, link rot happens anyway. [^science]: ["Going, Going, Gone: Lost Internet References"](http://scimaps.org/exhibit/docs/dellawalle.pdf); abstract: > "...The extent of Internet referencing and Internet reference activity in medical or scientific publications was systematically examined in more than 1000 articles published between 2000 and 2003 in the New England Journal of Medicine, The Journal of the American Medical Association, and Science. Internet references accounted for 2.6% of all references (672/25548) and in articles 27 months old, 13% of Internet references were inactive." [^schneier]: ["When the Internet Is My Hard Drive, Should I Trust Third Parties?"](http://www.wired.com/politics/security/commentary/securitymatters/2008/02/securitymatters_0221), _Wired_: > "Bits and pieces of the web disappear all the time. It's called 'link rot', and we're all used to it. A friend saved 65 links in 1999 when he planned a trip to Tuscany; only half of them still work today. In my own blog, essays and news articles and websites that I link to regularly disappear -- sometimes within a few days of my linking to them." My specific target date is 2070, 60 years from now. As of 10 March 2011, gwern.net has around 6800 external links (with around 2200 to non-Wikipedia websites). Even at the very lowest estimate of 3% annual linkrot, few will survive to 2070. If each link has a 97% chance of surviving each year, then the chance a link will be alive in 2070 is $0.97^{2070-2011} = ~0.17$ or just 17%, or to put it another way, an 83% chance any given link *will* die. The 95% [confidence interval](!Wikipedia) for the [Poisson distribution](!Wikipedia) says that of the 2200 non-Wikipedia links, ~338-409 will survive to 2070. If we try to predict using a more reasonable estimate of 50% linkrot, then approximately 0 links will survive ($0.50^{2070-2011} = 1.73472348 \times 10^{-16} = \~0$). It would be a good idea to simply assume that *no* link will survive. With that in mind, one can consider remedies. (If we lie to ourselves and say it won't be a problem in the future, then we guarantee that it *will* be a problem. ["People can stand what is true, for they are already enduring it."](http://wiki.lesswrong.com/wiki/Litany_of_Gendlin)) # Detection > "With every new spring \ > the blossoms speak not a word \ > yet expound the Law -- \ > knowing what is at its heart \ > by the scattering storm winds."^[Shōtetsu; 101, 'Buddhism related to Blossoms'; _Unforgotten Dreams: Poems by the Zen monk Shōtetsu_; trans. Steven D. Carter, ISBN 0-231-10576-2] The first and most obvious remedy is to learn about broken links as soon as they happen, which allows one to react quickly and scrape archives or search engine caches (['lazy preservation'](http://www.cs.odu.edu/~fmccown/research/lazy/)). I currently use [linkchecker](http://linkchecker.sourceforge.net/) to spider gwern.net looking for broken links. linkchecker is run in a [cron](!Wikipedia) job like so: ~~~{.bash} @monthly linkchecker www.gwern.net --ignore-url=^mailto --ignore-url=^irc --anchors --timeout=1200 --threads=20 --file-output=html ~~~ Just this command would turn up many false positives. For example, there would be several hundred warnings on Wikipedia links because I link to redirects; and linkchecker respects [robots.txt](!Wikipedia)s which forbid it to check liveness, but emits a warning about this. These can be suppressed by editing ~/.linkchecker/linkcheckerrc to say ignorewarnings=http-moved-permanent,http-robots-denied (the available warning classes are listed in linkchecker -h). The quicker you know about a dead link, the sooner you can look for replacements or its new home. # Prevention ## Local caching On a roughly monthly basis, I run a shell script named, imaginatively enough, local-archiver: ~~~{.bash} #!/bin/sh set -e cp find ~/.mozilla/ -name "places.sqlite" ~/ sqlite3 places.sqlite "SELECT url FROM moz_places, moz_historyvisits \ WHERE moz_places.id = moz_historyvisits.place_id and visit_date > strftime('%s','now','-1.5 month')*1000000 ORDER by \ visit_date;" | filter-urls >> ~/.tmp rm ~/places.sqlite split -l500 ~/.tmp ~/.tmp-urls rm ~/.tmp cd ~/www/ for file in ~/.tmp-urls*; do (wget --continue --page-requisites --timestamping --input-file $file && rm$file &); done find ~/www -size +4M -delete ~~~ The code is not the prettiest, but it's fairly straightforward: - the script grabs my Firefox browsing history by extracting it from the history SQL database file^[Much easier than it was in the past; [Jamie Zawinski](!Wikipedia) records his travails with the *previous* Mozilla history format in the aptly-named ["when the database worms eat into your brain"](http://www.jwz.org/blog/2004/03/when-the-database-worms-eat-into-your-brain/).], and feeds the URLs into [wget](!Wikipedia) - The script splits the long list of URLs into a bunch of files and runs that many wgets because wget apparently has no way of simultaneously downloading from multiple domains. - The filter-urls command is another shell script, which removes URLs I don't want archived. This script is a hack which looks like this: ~~~{.bash} #!/bin/sh set -e cat /dev/stdin | sed -e "s/#.*//" | sed -e "s/&sid=.*$//" | sed -e "s/\/$//" | grep -v -e 4chan -e reddit ... ~~~ A local copy is not the best resource - what if a link goes dead in a way your tool cannot detect so you don't *know* to put up your copy somewhere? But it solves the problem pretty decisively. The space consumed by such a backup is not that bad; only 30-50 gigabytes for a year of browsing, and less depending on how hard you prune the downloads. (More, of course, if you use linkchecker to archive entire sites and not just the pages you visit.) Storing this is quite viable in the long term; while page sizes have [increased 7x](http://www.websiteoptimization.com/speed/tweak/average-web-page/) between 2003 and 2011 and pages average around 400kb^[An older [2010 Google article](https://code.google.com/speed/articles/web-metrics.html) put the average at 320kb, but that was an average over the entire Web, including all the old content.], [Kryder's law](!Wikipedia) has also been operating and has increased disk capacity by ~128x - in 2011, \$80 will buy you at least [2 terabytes](http://forre.st/storage#hdd), that works out to 4 cents a gigabyte or 80 cents for the low estimate for downloads; that is much better than the$25 annual fee that somewhere like [Pinboard](http://pinboard.in/upgrade/) charges. Of course, you need to back this up yourself. We're relatively fortunate here - most Internet documents are 'born digital' and easy to migrate to new formats or inspect in the future. We can basically just download them and worry about how to view them only when we need a particular document, and Web browser backwards-compatibility already stretches back to files written in the early 1990s. (Of course, we're probably screwed if we discover the content we wanted was presented only in Adobe Flash or as an inaccessible 'cloud' service.) In contrast, if we were trying to preserve programs or software libraries instead, we would face a much more formidable task in keeping a working ladder of binary-compatible virtual machines or interpreters^[Already one runs classic [LucasArts adventure games](!Wikipedia) and whatnot in emulators like [DOSBox](!Wikipedia); but those emulators will not always be maintained. Who will emulate the emulators? Well, presumably one will in 2050 instead emulate on one's laptop some ancient but compatible OS - Windows 7 or Debian 6.0, perhaps - and inside *that* run DOSBox (to run the game).]. The situation with [digital movie preservation](http://www.davidbordwell.net/blog/2012/02/13/pandoras-digital-box-pix-and-pixels/ "Pandora's digital box: Pix and pixels") hardly bears thinking on. There are ways to cut down on the size; if you tar it all up and run [7-Zip](!Wikipedia) with maximum compression options, you could probably compact it to 1/5th the size. I found that the uncompressed files could be reduced by around 10% by using [fdupes](!Wikipedia) ([homepage](http://netdial.caribe.net/~adrian2/fdupes.html)) to, like [freedup](!Wikipedia "freedup"), look for duplicate files and turning the duplicates into a space-saving [hard link](!Wikipedia) to the original with a command like fdupes --recurse --hardlink ~/www/. (Apparently there are a *lot* of bit-identical JavaScript (eg. [JQuery](!Wikipedia)) and images out there.) ## Remote caching We can ask a third party to keep a cache for us. The only three general [archive site](!Wikipedia) I know of are 1. the [Internet Archive](!Wikipedia) 2. [WebCite](!Wikipedia) 3. Linterweb's WikiWix^[Which I suspect is only accidentally 'general' and would shut down access if there were some other way to ensure that Wikipedia external links still got archived.]. There are other options but they are not available like Google^[Google Cache is generally recommended only as a last ditch resort because pages expire quickly from it. Personally, I'm convinced that Google would never just delete colossal amounts of Internet data - this is Google, after all, the epitome of storing unthinkable amounts of data - and that Google Cache merely ceases to make public its copies. And to request a Google spider visit, one has to solve a CAPTCHA - so that's not a scalable solution.] or various commercial/government archives^[Which obviously are not publicly accessible or submittable; I know they exist, but because they hide themselves, I know only from random comments online eg. ["years ago a friend of mine who I'd lost contact with caught up with me and told me he found a cached copy of a website I'd taken down in his employer's equivalent to the Wayback Machine. His employer was a branch of the federal government."](http://news.ycombinator.com/item?id=2880427).] (An example would be being archived at .) My first program in this vein of thought was a bot which fired off WebCite and Internet Archive/Alexa requests: [Wikipedia Archiving Bot](haskell/Wikipedia Archive Bot), quickly followed up by a [RSS version](haskell/Wikipedia RSS Archive Bot). (Or you could install the [Alexa Toolbar](!Wikipedia) to get automatic submission to the Internet Archive, if you have ceased to care about privacy.) The core code was quickly adapted into a [gitit](!Hackage) wiki plugin which hooked into the save-page functionality and tried to archive every link in the newly-modified page, [Interwiki.hs](https://github.com/jgm/gitit/blob/master/plugins/Interwiki.hs) Finally, I wrote [archiver](!Hackage), a daemon which watches[^watch]/reads a text file. (Source is available via darcs get http://community.haskell.org/~gwern/archiver/.) [^watch]: Version [0.1](http://hackage.haskell.org/package/archiver-0.1) of my archiver daemon didn't simply read the file until it was empty and exit, but actually watched it for modifications with [inotify](!Wikipedia). I removed this functionality when I realized that the required WebCite choking (just one URL every ~25 seconds) meant that archiver would *never* finish any reasonable workload. The library half of archiver is a simple wrapper around the appropriate HTTP requests; the executable half reads a specified text file and loops as it (slowly) fires off requests and deletes the appropriate URL. That is, archiver is a daemon which will process a specified text file, each line of which is a URL, and will one by one request that the URLs be archived or spidered Usage of archiver might look like archiver ~/.urls.txt gwern0@gmail.com. In the past, archiver would sometimes crash for unknown reasons, so I usually wrap it in a while loop like so: while true; do archiver ~/.urls.txt gwern0@gmail.com; done. If I wanted to put it in a detached [GNU screen](!Wikipedia) session: screen -d -m -S "archiver" sh -c 'while true; do archiver ~/.urls.txt gwern0@gmail.com; done'. archiver has an extra feature where any third argument is treated as an arbitrary sh command to run after each URL is archived, to which is appended said URL. You might use this feature if you wanted to load each URL into Firefox, or append them to a log file, or simply download or archive the URL in some other way. For example, in conjunction with the big _en masse_ local-archiver runs, I have archiver run wget on each individual URL: screen -d -m -S "archiver" sh -c 'while true; do archiver ~/.urls.txt gwern0@gmail.com "cd ~/www && wget --continue --page-requisites --timestamping -e robots=off --reject .iso,.exe,.gz,.xz,.rar,.7z,.tar,.bin,.zip,.jar,.flv,.mp4,.avi,.webm --user-agent='Firefox/3.5'"; done'. Alternately, you might use curl or a specialized archive downloader like the Internet Archive's crawler [Heritrix](http://crawler.archive.org/). ### URL sources #### Browser history There are a number of ways to populate the source text file. For example, I have a script firefox-urls: ~~~{.bash} #!/bin/sh set -e cp --force find ~/.mozilla/firefox/ -name "places.sqlite"|sort|head -1 ~/ sqlite3 -batch places.sqlite "SELECT url FROM moz_places, moz_historyvisits \ WHERE moz_places.id = moz_historyvisits.place_id and visit_date > strftime('%s','now','-1 day')*1000000 ORDER by \ visit_date;" | filter-urls rm ~/places.sqlite ~~~ (filter-urls is the same script as in local-archiver. If I don't want a domain locally, I'm not going to bother with remote backups either. In fact, because of WebCite's rate-limiting, archiver is almost perpetually back-logged, and I *especially* don't want it wasting time on worthless links like [4chan](!Wikipedia).) This is called every hour by cron: ~~~{.bash} @hourly firefox-urls >> ~/.urls.txt ~~~ This gets all visited URLs in the last time period and prints them out to the file for archiver to process. Hence, everything I browse is backed-up through archiver. #### Document links More useful perhaps is a script to extract external links from Markdown files and print them to standard out: [link-extractor.hs](haskell/link-extractor.hs) So now I can take find . -name "*.page", pass the 100 or so Markdown files in my wiki as arguments, and add the thousand or so external links to the archiver queue (eg. find . -name "*.page" -type f -print0 | xargs -0 runhaskell haskell/link-extractor.hs | filter-urls >> ~/.urls.txt); they will eventually be archived/backed up. #### Website spidering Sometimes a particular website is of long-term interest to one even if one has not visited *every* page on it; one could manually visit them and rely on the previous Firefox script to dump the URLs into archiver but this isn't always practical or time-efficient. linkchecker inherently spiders the websites it is turned upon, so it's not a surprise that it can build a [site map](!Wikipedia) or simply spit out all URLs on a domain; unfortunately, while linkchecker has the ability to output in a remarkable variety of formats, it cannot simply output a newline-delimited list of URLs, so we need to post-process the output considerably. The following is the shell one-liner I use when I want to archive an entire site (note that this is a bad command to run on a large or heavily hyper-linked site like the English Wikipedia or [LessWrong](http://lesswrong.com)!); edit the target domain as necessary: ~~~{.bash} nice linkchecker -odot --complete -v --ignore-url=^mailto --no-warnings http://www.longbets.org | fgrep http | fgrep -v -e "label=" -e "->" -e '" [' -e '" ]' -e "/ " | sed -e "s/href=\"//" -e "s/\",//" -e "s/ //" | filter-urls >> ~/.urls.txt ~~~ # Reacting to broken links archiver combined with a tool like link-checker means that there will rarely be any broken links on gwern.net since one can either find a live link or use the archived version. In theory, one has multiple options now: 0. Search for a copy on the live Web 1. link the Internet Archive copy 2. link the WebCite copy 3. link the WikiWix copy 4. use the wget dump If it's been turned into a full local file-based version with --convert-links --page-requisites, one can easily convert the dump into something like a standalone PDF suitable for public distribution. (A PDF is easier to store and link than the original directory of bits and pieces or other HTML formats like a ZIP archive of said directory.) I use [wkhtmltopdf](http://code.google.com/p/wkhtmltopdf/) which does a good job; an example of a dead webpage with no Internet mirrors is http://www.aeiveos.com/~bradbury/MatrioshkaBrains/MatrioshkaBrainsPaper.html which can be found at [/docs/1999-bradbury-matrioshkabrains.pdf]().