I want to read several of the pages in the examples
category in the High Scalability blog, enough that I felt a bit sick when I started opening tabs.
Since I recently got a Kindle Paperwhite I'm really liking (here's another recent EPUB related project for it) I thought Can't I just convert these pages into a book?
- A list of URLs is available in the file
urls
, which is parsed via… - The
doit.awk
script (name inspiration), by running./doit.awk urls
(you need gawk).
The script, in turn:
- Reads URLs line by line from the
urls
file; - Creates a neat orderable filename based on each URL;
- Creates a
pandoc
command to fetch from the URL into the filename; - Runs a
clean.awk
script that purges anything in the markdown file outside the body of the post; - Sleeps 15 seconds to avoid being nasty to the destination server;
- Creates an EPUB from the cleaned-up markdowns sorted in lexicographical order.
I like AWK, and iterating through "arrays" in bash scripts or makefiles is a pain (imagine Captain Kirk screaming xargs
instead of Khan), whereas iterating through lines in AWK and running a command is natural. I have skipped checking for errors in the pandoc
commands (you can get the output code from system
as a return value, though) for brevity, but seriously, AWK is very convenient when you have something quick, dirty and where you know bash
is going to be a pain.
The parsing in clean.awk
is tied specifically to the formatting in High Scalability: if you want to use your own URLs, comment all the system commands in doit.awk
(#
for comments in AWK), run manually the pandoc extraction and check what the markdown looks like, then parse accordingly.
I have only checked the first 3-4 posts for consistency in the generated markdown/EPUB. Since they were OK my expectation is that all are OK which is as good as it gets until I read them all. Caveat emptor.