I want to read several of the pages in the examples category in the High Scalability blog, enough that I felt a bit sick when I started opening tabs.
Since I recently got a Kindle Paperwhite I'm really liking (here's another recent EPUB related project for it) I thought Can't I just convert these pages into a book?
- A list of URLs is available in the file
urls, which is parsed via… - The
doit.awkscript (name inspiration), by running./doit.awk -v download=1 title="High scalability" urls(you need gawk).
The script, in turn:
- Reads URLs line by line from the
urlsfile; - Creates a neat orderable filename based on each URL;
- Creates a
pandoccommand to fetch from the URL into the filename; - Runs a
clean.awkscript that purges anything in the markdown file outside the body of the post; - Sleeps 15 seconds to avoid being nasty to the destination server;
- Creates an EPUB from the cleaned-up markdowns sorted in lexicographical order.
I like AWK, and iterating through "arrays" in bash scripts or makefiles is a pain (imagine Captain Kirk screaming xargs instead of Khan), whereas iterating through lines in AWK and running a command is natural. I have skipped checking for errors in the pandoc commands (you can get the output code from system as a return value, though) for brevity, but seriously, AWK is very convenient when you have something quick, dirty and where you know bash is going to be a pain.
The parsing in clean.awk is tied specifically to the formatting in High Scalability: if you want to use your own URLs, comment all the system commands in doit.awk (# for comments in AWK), run manually the pandoc extraction and check what the markdown looks like, then parse accordingly.
I have also a clean_login.awk file for USENIX's ;login: articles.
I have only checked the first 3-4 posts for consistency in the generated markdown/EPUB. Since they were OK my expectation is that all are OK which is as good as it gets until I read them all. Caveat emptor.