Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Browse files

Initial

  • Loading branch information...
commit 3e248db216d302a1b80c81744697afe7d21f5fbc 0 parents
Tom Link authored
112 History.txt
@@ -0,0 +1,112 @@
+= 0.6
+
+* RSS attachments: Source title is preferred to the channel's title.
+* body_html: If there is no body tag, use the document as is.
+* rss: also scan items without descriptions with :rss_find_enclosure
+
+= 0.5
+
+* mailto: and javascript: hrefs are now handled via the exclude option
+* rewrite absolute URLs sans host correctly
+* strip href and image src tags in order to prevent parser errors
+* some scaffolding for mechanize
+* global proxy option (currently only used for mechanize)
+* use -nolist for lynx
+* catch errors in Websitary::App#execute_downdiff
+* :rss_find_enclosure => LAMBDA: Extract the enclosure URL from the item
+ description
+* :rss_format_local_copy => STRING|BLOCK/2: Format the display of the
+ local copy.
+
+
+= 0.4
+
+* Sources may have a :timeout option.
+* exclude: Argument can be a string or a regexp.
+* htmldiff: :ignore option to exclude certain nodes from the diff.
+* Left-mouse clicks make items collapse/expand.
+* iconv: Support for converting encodings (require the per-url iconv
+ option to be set).
+* exclude mailto urls.
+
+
+= 0.3
+
+* Renamed the global option :downloadhtml to :download_html.
+* The downloader for robots and rss enclosures should now be properly
+ configurable via the global options :download_robots and
+ :download_rss_enclosure (default: :openuri).
+* Respect rel="nofollow" on hyperreferences.
+* :wdays, :mdays didn't work.
+* --exclude command line options, exclude configuration command
+* Check for robots.txt-compliance after testing if the URL is
+ appropriate.
+* htmldiff.rb can now also highlight differences � la websec's webdiff.
+* configuration.rb: Ignore pubDate and certain other non-essential fields (tags
+ etc.) when constructing rss item IDs.
+
+
+= 0.2.1
+
+* Use URI.merge for constructing robots.txt uri.
+* Fixed minor show-stopper.
+
+
+= 0.2.0
+
+* Renamed the project from websitiary to websitary (without the
+ additional "i")
+* The default output filename is now constructed on basis of the profile
+ names joined with a comma.
+* Apply rewrite-rules to URLs in text output.
+* Set user-agent (:body_html)
+* Exit with 1 if differences were found
+* Command line options have slightly changed: -e now is the short form
+ for --execute
+* Commands that can be triggered by the -e command-line switch: downdiff
+ (default), configuration (list currently configured urls), latest
+ (show the current version of all urls), review (show the latest
+ report)
+* Protect against filenames being too long (max size can be configured
+ via: <tt>option :global, :filename_size => N</tt>)
+* Try to migrate local copies from the older flat to the new
+ hierarchical cache layout
+* Disabled -E/--edit, --review command-line options (use -e instead)
+* Try to maintain file atime/mtime when copying/moving files
+* FIX: Problem with loading robots.txt
+* Respect meta tag: robots="nofollow" (noindex is only checked in
+ conjunction with :download => :website*)
+* quicklist profile: register urls via the -eadd command-line switch;
+ see "Usage" for an example
+* Temporaly save diffs, so that we can reuse them when websitary should
+ exit ungracefully.
+* Renamed :inner_html to :body_html
+* New shortcuts: :ftp, :ftp_recursive, :img, :rss, :opml (rudementary)
+* New experimental commands: aggregate, show ... can be used to
+ periodically check for changes (e.g. of rss feeds) but to review these
+ changes only once in a while
+* Experimental --timer command-line option to re-run websitary every X
+ seconds.
+* The :rss differ has an option :rss_enclosure (true or directory name)
+ that will be used for automatically saving new enclosures (e.g. mp3
+ files in podcasts); in theory, one should thus be able to use
+ websitary as pod catcher etc.
+* Cache mtimes in order to reduce disk access.
+* Special profile "__END__": The section in the script file after the
+ __END__ line. This seems useful in some situations when employing a
+ single script.
+* Don't follow javascript links.
+* New date constraint for sources:
+ :daily => true ... Once a day
+ :days_of_month => BEGIN..END ... download URL only once per month
+ within this range of days.
+ :days_of_week => BEGIN..END ... download URL only once per week
+ within this range of days.
+ :months => N (calculated on basis of the calendar month, not the
+ number of days)
+
+
+== 0.1.0 / 2007-07-16
+
+* Initial release
+
88 Makefile
@@ -0,0 +1,88 @@
+include Makefile.config
+
+all: dbk html pdf tex text man
+
+dvi: ${BASE}.dvi
+dbk: ${BASE}.dbk
+html: ${BASE}.html
+pdf:
+ make DFLAGS="${DFLAGS} --pdf" "${BASE}.pdf"
+php: ${BASE}.php
+tex: ${BASE}.tex
+text: ${BASE}.text
+man: ${BASE}.1
+
+pdfclean: pdf cleantex
+dviclean: dvi cleantex
+
+makefile:
+ ${DEPLATE} -m makefile ${DFLAGS} ${BASE}.txt ${OTHER}
+
+website:
+ make prepare_website
+ ${DEPLATE} ${DFLAGS} ${WEBSITE_DFLAGS} ${FILE} ${OTHER}
+ echo ${WEBSITE_DIR}/${BASE}.html > .last_output
+
+%.html: %.txt
+ make prepare_html
+ ${DEPLATE} ${DFLAGS} ${HTML_DFLAGS} $< ${OTHER}
+ echo ${HTML_DIR}/$@ > .last_output
+
+%.text: %.txt
+ make prepare_text
+ ${DEPLATE} ${DFLAGS} ${TEXT_DFLAGS} $< ${OTHER}
+ echo ${TEXT_DIR}/$@ > .last_output
+
+%.php: %.txt
+ make prepare_php
+ ${DEPLATE} ${DFLAGS} ${PHP_DFLAGS} $< ${OTHER}
+ echo ${PHP_DIR}/$@ > .last_output
+
+%.dbk: %.txt
+ make prepare_dbk
+ ${DEPLATE} ${DFLAGS} ${DBK_DFLAGS} $< ${OTHER}
+ echo ${DBK_DIR}/$@ > .last_output
+
+%.tex: %.txt
+ make prepare_tex
+ ${DEPLATE} ${DFLAGS} ${TEX_DFLAGS} $< ${OTHER}
+ echo ${TEX_DIR}/$@ > .last_output
+
+%.ref: %.txt
+ make prepare_ref
+ ${DEPLATE} ${DFLAGS} ${REF_DFLAGS} -o $@ $< ${OTHER}
+ echo ${REF_DIR}/$@ > .last_output
+
+%.dvi: %.tex
+ make prepare_dvi
+ cd ${TEX_DIR}; \
+ latex ${LATEX_FLAGS} $<; \
+ bibtex ${BIBTEX_FLAGS} $*; \
+ latex ${LATEX_FLAGS} $<; \
+ latex ${LATEX_FLAGS} $<;
+ echo ${TEX_DIR}/$@ > .last_output
+
+%.pdf: %.tex
+ make prepare_pdf
+ cd ${TEX_DIR}; \
+ pdflatex ${PDFLATEX_FLAGS} $<; \
+ bibtex ${BIBTEX_FLAGS} $*; \
+ pdflatex ${PDFLATEX_FLAGS} $<; \
+ pdflatex ${PDFLATEX_FLAGS} $<
+ echo ${TEX_DIR}/$@ > .last_output
+
+%.1: %.ref
+ cd ${REF_DIR}; \
+ xmlto man $<
+ echo ${REF_DIR}/$@ > .last_output
+
+view: show
+show:
+ cygstart `cat .last_output`
+
+cleantex:
+ cd ${TEX_DIR}; \
+ rm -f *.toc *.aux *.log *.cp *.fn *.tp *.vr *.pg *.ky \
+ *.blg *.bbl *.out *.lot *.ind *.4tc *.4ct \
+ *.ilg *.idx *.idv *.lg *.xref || echo Nothing to be done!
+
100 Makefile.config
@@ -0,0 +1,100 @@
+FILE=index.txt
+BASE=$(basename ${FILE})
+OTHER=
+
+DEPLATE=deplate
+SCP=scp
+# SCP=pscp
+
+ # --css deplate \
+ # -t html-tabbar-top.html
+DFLAGS=-m code-gvim -D noSwallow=1
+
+HTML_DIR=html
+HTML_PLUS=-m html-obfuscate-email -m html-deplate-button \
+ --css deplate \
+ -t html-tabbar-right.html -m navbar-png
+HTML_DFLAGS=${HTML_PLUS} -d ${HTML_DIR} -f html
+
+WEBSITE_DIR=website
+WEBSITE_DFLAGS=${HTML_PLUS} -d ${WEBSITE_DIR} -f htmlsite
+
+PHP_DIR=php
+PHP_DFLAGS=${HTML_PLUS} -d ${PHP_DIR} -f phpsite -m html-obfuscate-email
+
+TEX_DIR=tex
+TEX_DFLAGS=-d ${TEX_DIR} -f latex
+
+TEXT_DIR=plain
+TEXT_DFLAGS=-d ${TEXT_DIR} -f plain
+
+DBK_DIR=docbook
+DBK_DFLAGS=-d ${DBK_DIR} -f dbk-article
+
+REF_DIR=${DBK_DIR}
+REF_DFLAGS=-d ${REF_DIR} -f dbk-ref
+
+LATEX_FLAGS=-interaction=nonstopmode
+PDFLATEX_FLAGS=${LATEX_FLAGS}
+BIBTEX_FLAGS=
+
+copy_images=if ls *.{jpg,jpeg,png,gif} 2> /dev/null; then cp -uv *.{jpg,jpeg,png,gif} $(1); fi
+copy_css=if ls *.css 2> /dev/null; then cp -vu *.css $(1); fi
+
+.PHONY: view show cleantex website makefile pdfclean dviclean dbk html pdf tex text man prepare_website prepare_html prepare_text prepare_php prepare_dbk prepare_tex prepare_ref prepare_dvi prepare_pdf
+
+default: website
+
+manual.pdf:
+ make FILE=websitary.txt pdf
+
+manual: website manual.pdf
+
+upload:
+ ${SCP} website/* tex/websitary.pdf tlink@rubyforge.org:/var/www/gforge-projects/websitiary/
+
+docs:
+ rake docs
+ ${SCP} docs/* tlink@rubyforge.org:/var/www/gforge-projects/websitiary/websitary/
+
+prepare_website:
+ mkdir -p ${WEBSITE_DIR}
+ $(call copy_images,"${WEBSITE_DIR}")
+ $(call copy_css,"${WEBSITE_DIR}")
+
+prepare_html:
+ mkdir -p ${HTML_DIR}
+ $(call copy_images,"${HTML_DIR}")
+ $(call copy_css,"${HTML_DIR}")
+
+prepare_text:
+ mkdir -p ${TEXT_DIR}
+
+prepare_php:
+ mkdir -p ${PHP_DIR}
+ $(call copy_images,"${PHP_DIR}")
+ $(call copy_css,"${PHP_DIR}")
+
+prepare_dbk:
+ mkdir -p ${DBK_DIR}cp -uv *.{jpg,jpeg,png,gif} $(1) || echo ... ignore errorcp -uv *.{jpg,jpeg,png,gif} $(1) || echo ... ignore error
+ $(call copy_images,"${DBG_DIR}")
+
+prepare_tex:
+ mkdir -p ${TEX_DIR}
+ $(call copy_images,"${TEX_DIR}")
+
+prepare_ref:
+ mkdir -p ${REF_DIR}
+
+prepare_dvi:
+
+prepare_pdf:
+
+ctags:
+ rm tags
+ ctags -R bin lib
+
+files:
+ find bin lib -name "*.rb" > files.lst
+
+# vi: ft=make:tw=72:ts=4
780 README.txt
@@ -0,0 +1,780 @@
+websitary by Thomas Link
+http://rubyforge.org/projects/websitiary/
+
+This script monitors webpages, rss feeds, podcasts etc. and reports
+what's new. For many tasks, it reuses other programs to do the actual
+work. By default, it works on an ASCII basis, i.e. with the output of
+text-based webbrowsers. With the help of some friends, it works also
+with HTML.
+
+
+== DESCRIPTION:
+websitary (formerly known as websitiary with an extra "i") monitors
+webpages, rss feeds, podcasts etc. It reuses other programs (w3m, diff
+etc.) to do most of the actual work. By default, it works on an ASCII
+basis, i.e. with the output of text-based webbrowsers like w3m (or lynx,
+links etc.) as the output can easily be post-processed. It can also work
+with HTML and highlight new items. This script was originally planned as
+a ruby-based websec replacement.
+
+By default, this script will use w3m to dump HTML pages and then run
+diff over the current page and the previous backup. Some pages are
+better viewed with lynx or links. Downloaded documents (HTML or ASCII)
+can be post-processed (e.g., filtered through some ruby block that
+extracts elements via hpricot and the like). Please see the
+configuration options below to find out how to change this globally or
+for a single source.
+
+This user manual is also available as
+PDF[http://websitiary.rubyforge.org/websitary.pdf].
+
+
+== FEATURES/PROBLEMS:
+* Handle webpages, rss feeds (optionally save attachments in podcasts
+ etc.)
+* Compare webpages with previous backups
+* Display differences between the current version and the backup
+* Provide hooks to post-process the downloaded documents and the diff
+* Display a one-page report summarizing all news
+* Automatically open the report in your favourite web-browser
+* Experimental: Download webpages on defined intervalls and generate
+ incremental diffs.
+
+ISSUES, TODO:
+* With HTML output, changes are presented on one single page, which
+ means that pages with different encodings cause problems.
+* Improved support for robots.txt (test it)
+* The use of :website_below and :website is hardly tested (please
+ report errors).
+* download => :body_html tries to rewrite references (a, img) which may
+ fail on certain kind of urls (please report errors).
+* When using :body_html for download, it may happen that some
+ JavaScript code is stripped, which breaks some JavaScript-generated
+ links.
+* The --log command line will create a new instance of the logger and
+ thus reset any previous options related to the logging level.
+
+NOTE: The script was previously called websitiary but was renamed (from
+0.2 on) to websitary (without the superfluous i).
+
+
+=== Caveat
+The script also includes experimental support for monitoring whole
+websites. Basically, this script supports robots.txt directives (see
+requirements) but this is hardly tested and may not work in some cases.
+
+While it is okay for your own websites to ignore robots.txt, it is not
+for others. Please make sure that the webpages you run this program on
+allow such a use. Some webpages disallow the use of any automatic
+downloader or offline reader in their user agreements.
+
+
+== SYNOPSIS:
+
+=== Usage
+Example:
+ # Run "profile"
+ websitary profile
+
+ # Edit "~/.websitary/profile.rb"
+ websitary --edit=profile
+
+ # View the latest report
+ websitary -ereview
+
+ # Refetch all sources regardless of :days and :hours restrictions
+ websitary -signore_age=true
+
+ # Create html and rss reports for my websites
+ websitary -fhtml,rss mysites
+
+ # Add an url to the quicklist profile
+ websitary -eadd http://www.example.com
+
+For example output see:
+* html[http://deplate.sourceforge.net/websitary.html]
+* rss[http://deplate.sourceforge.net/websitary.rss]
+* text[http://deplate.sourceforge.net/websitary.txt]
+
+
+=== Configuration
+Profiles are plain ruby files (with the '.rb' suffix) stored in
+~/.websitary/.
+
+The profile "config" (~/.websitary/config.rb) is always loaded if
+available.
+
+There are two special profile names:
+
+-::
+ Read URLs from STDIN.
+<tt>__END__</tt>::
+ Read the profile contained in the script source after the __END__
+ line.
+
+
+==== default 'PROFILE1', 'PROFILE2' ...
+Set the default profile(s). The default is: quicklist
+
+Example:
+ default 'my_profile'
+
+
+==== diff 'CMD "%s" "%s"'
+Use this shell command to make the diff.
+%s %s will be replaced with the old and new filename.
+
+diff is used by default.
+
+
+==== diffprocess lambda {|text| ...}
+Use this ruby snippet to post-process the diff.
+
+
+==== download 'CMD "%s"'
+Use this shell command to download a page.
+%s will be replaced with the url.
+
+w3m is used by default.
+
+Example:
+ download 'lynx -dump "%s"'
+
+
+==== downloadprocess lambda {|text| ...}
+Use this ruby snippet to post-process what was downloaded. Return the
+new text.
+
+
+==== edit 'CMD "%s"'
+Use this shell command to edit a profile. %s will be replaced with the filename.
+
+vi is used by default.
+
+Example:
+ edit 'gvim "%s"&'
+
+
+==== option TYPE, OPTION => VALUE
+Set a global option.
+
+TYPE can be one of:
+<tt>:diff</tt>::
+ Generate a diff
+<tt>:diffprocess</tt>::
+ Post-process a diff (if necessary)
+<tt>:format</tt>::
+ Format the diff for output
+<tt>:download</tt>::
+ Download webpages
+<tt>:downloadprocess</tt>::
+ Post-process downloaded webpages
+<tt>:page</tt>::
+ The :format field defines the format of the final report. Here VALUE
+ is a format string that takes 3 variables as arguments: report title,
+ toc, contents.
+<tt>:global</tt>::
+ Set a "global" option.
+
+DOWNLOAD is a symbol
+
+VALUE is either a format string or a block of code (of class Proc).
+
+Example:
+ set :download, :foo => lambda {|url| get_url(url)}
+
+
+==== global OPTION => VALUE
+This is the same a <tt>option :global, OPTION => VALUE</tt>.
+
+Known global options:
+
+<tt>:canonic_filename => BLOCK(FILENAME)</tt>::
+ Rewrite filenames as they are stored in the mtimes register. This may
+ useful if you want to use the same repository on several computers
+ with in different locations etc.
+
+<tt>:encoding => OUTPUT_DOCUMENT_ENCODING</tt>::
+ The default is 'ISO-8859-1'.
+
+<tt>:downloadhtml => SHORTCUT</tt>::
+ The default shortcut for downloading plain HTML.
+
+<tt>:file_url => BLOCK(FILENAME)</tt>::
+ Rewrite a filename as it is used for creating file urls to local
+ copies in the output. This may useful if you want to use the same
+ repository on several computers with in different locations etc.
+
+<tt>:filename_size => N</tt>::
+ The max filename size. If a filename becomes longer, md5 encoding will
+ be used for local copies in the cache.
+
+<tt>:toggle_body => BOOLEAN</tt>::
+ If true, make a news body collabsable on mouse-clicks (sort of).
+
+<tt>:proxy => STRING</tt>, <tt>:proxy => ARRAY</tt>::
+ The proxy. (currently only supported by mechanize)
+
+<tt>:user_agent => STRING</tt>::
+ Set the user agent (only for certain queries).
+
+
+==== output_format FORMAT, output_format [FORMAT1, FORMAT2, ...]
+Set the output format.
+Format can be one of:
+
+* html
+* text, txt (this only works with text based downloaders)
+* rss (prove of concept only;
+ it requires :rss[:url] to be set to the url, where the rss feed will
+ be published, using the <tt>option :rss, :url => URL</tt>
+ configuration command; you either have to use a text-based downloader
+ or include <tt>:rss_format => 'html'</tt> to the url options)
+
+
+==== set OPTION => VALUE; set TYPE, OPTION => VALUE; unset OPTIONS
+(Un)Set an option for the following source commands.
+
+Example:
+ set :download, :foo => lambda {|url| get_url(url)}
+ set :days => 7, sort => true
+ unset :days, :sort
+
+
+==== source URL(S), [OPTIONS]
+Options
+
+<tt>:cols => FROM..TO</tt>::
+ Use only these colums from the output (used after applying the :lines
+ option)
+
+<tt>:depth => INTEGER</tt>::
+ In conjunction with a :website type of :download option, fetch url up
+ to this depth.
+
+<tt>:diff => "CMD", :diff => SHORTCUT</tt>::
+ Use this command to make the diff for this page. Possible values for
+ SHORTCUT are: :webdiff (useful in conjunction with :download => :curl,
+ :wget, or :body_html), :websec_webdiff (use websec's webdiff tool),
+ :body_html, :website_below, :website and :openuri are synonyms for
+ :webdiff.
+ NOTE: Since version 0.3, :webdiff is mapped to websitary's own
+ htmldiff class (which can also be used as stand-alone script). Before
+ 0.3, websitary used websec's webdiff script, which is now mapped to
+ :websec_webdiff.
+
+<tt>:diffprocess => lambda {|text| ...}</tt>::
+ Use this ruby snippet to post-process this diff
+
+<tt>:download => "CMD", :download => SHORTCUT</tt>::
+ Use this command to download this page. For possible values for
+ SHORTCUT see the section on shortcuts below.
+
+<tt>:downloadprocess => lambda {|text| ...}</tt>::
+ Use this ruby snippet to post-process what was downloaded. This is the
+ place where, e.g., hpricot can be used to extract certain elements
+ from the HTML code.
+ Example:
+ lambda {|text| Hpricot(text).at('div#content').inner_html}
+
+<tt>:format => "FORMAT %s STRING", :format => SHORTCUT</tt>::
+ The format string for the diff text. The default (the :diff shortcut)
+ wraps the output in +pre+ tags. :webdiff, :body_html, :website_below,
+ :website, and :openuri will simply add a newline character.
+
+<tt>:iconv => ENCODING</tt>::
+ If set, use iconv to convert the page body into the summary's document
+ encoding (see the 'global' section). Websitary currently isn't able to
+ automatically determine and convert encodings.
+
+<tt>:timeout => SECONDS</tt>::
+ When using openuri, download the page with a timeout.
+
+<tt>:hours => HOURS, :days => DAYS</tt>::
+ Don't download the file unless it's older than that
+
+<tt>:days_of_month => DAY..DAY, :wdays => DAY..DAY</tt>::
+ Download only once per month within a certain range of days (e.g.,
+ 15..31 ... Check once after the 15th). The argument can also be an
+ array (e.g, [1, 15]) or an integer.
+
+<tt>:days_of_week => DAY..DAY, :mdays => DAY..DAY</tt>::
+ Download only once per week within a certain range of days (e.g., 1..2
+ ... Check once on monday or tuesday; sunday = 0). The argument can
+ also be an array (e.g, [1, 15]) or an integer.
+
+<tt>:daily => true</tt>::
+ Download only once a day.
+
+<tt>:ignore_age => true</tt>::
+ Ignore any :days and :hours settings. This is useful in some cases
+ when set on the command line.
+
+<tt>:lines => FROM..TO</tt>::
+ Use only these lines from the output
+
+<tt>:match => REGEXP</tt>::
+ When recursively walking a website, follow only links that match this
+ regexp.
+
+<tt>:rss_rewrite_enclosed_urls => true</tt>::
+ If true, replace urls in the rss feed item description pointing to the
+ enclosure with a file url pointing to the local copy
+
+<tt>:rss_enclosure => true|"DIRECTORY"</tt>::
+ If true, save rss feed enclosures in
+ "~/.websitary/attachments/RSS_FEED_NAME/". If a string, use this as
+ destination directory. Only enclosures of new items will be saved --
+ i.e. when downloading a feed for the first time, no enclosures will be
+ saved.
+
+<tt>:rss_find_enclosure => BLOCK</tt>::
+ Certain RSS-feeds embed enclosures in the description. Use this option
+ to scan the description (a Hpricot document) for an URL that is then saved
+ as enclosure if the :rss_enclosure option is set.
+ Example:
+ source 'http://www.example.com/rss',
+ :title => 'Example',
+ :use => :rss, :rss_enclosure => true,
+ :rss_find_enclosure => lambda {|item, doc| (doc / 'img').map {|e| e['src']}[0]}
+
+<tt>:rss_format (default: "plain_text")</tt>::
+ When output format is :rss, create rss item descriptios as plain text.
+
+<tt>:rss_format_local_copy => FORMAT_STRING | BLOCK</tt>::
+ By default a hypertext reference to the local copy of an RSS
+ enclosure is added to entry. Sometimes you may want to display
+ something inline (e.g. an image). You can then use this option to
+ define a format string (one field = the local copy's file url).
+
+<tt>:show_initial => true</tt>::
+ Include initial copies in the report (may not always work properly).
+ This can also be set as a global option.
+
+<tt>:sleep => SECS</tt>::
+ Wait SECS seconds (float or integer) before downloading the page.
+
+<tt>:sort => true, :sort => lambda {|a,b| ...}</tt>::
+ Sort lines in output
+
+<tt>:strip => true</tt>::
+ Strip empty lines
+
+<tt>:title => "TEXT"</tt>::
+ Display TEXT instead of URL
+
+<tt>:use => SYMBOL</tt>::
+ Use SYMBOL for any other option. I.e. <tt>:download => :body_html
+ :diff => :webdiff</tt> can be abbreviated as <tt>:use =>
+ :body_html</tt> (because for :diff :body_html is a synonym for
+ :webdiff).
+
+The order of age constraints is:
+:hours > :daily > :wdays > :mdays > :days > :months.
+I.e. if :wdays is set, :mdays, :days, or :months are ignored.
+
+
+==== view 'CMD "%s"'
+Use this shell command to view the output (usually a HTML file).
+%s will be replaced with the filename.
+
+w3m is used by default.
+
+Example:
+ view 'gnome-open "%s"' # Gnome Desktop
+ view 'kfmclient "%s"' # KDE
+ view 'cygstart "%s"' # Cygwin
+ view 'start "%s"' # Windows
+ view 'firefox "%s"'
+
+
+=== Shortcuts for use with :use, :download and other options
+<tt>:w3m</tt>::
+ Use w3m for downloading the source. Use diff for generating diffs.
+
+<tt>:lynx</tt>::
+ Use lynx for downloading the source. Use diff for generating diffs.
+ Lynx doesn't try to recreate the layout of a page like w3m or links
+ do. As a result the output IMHO sometimes deviates from the original
+ design but is better suited for being post-processed in some
+ situation.
+
+<tt>:links</tt>::
+ Use links for downloading the source. Use diff for generating diffs.
+
+<tt>:curl</tt>::
+ Use curl for downloading the source. Use webdiff for generating diffs.
+
+<tt>:wget</tt>::
+ Use wget for downloading the source. Use webdiff for generating diffs.
+
+<tt>:openuri</tt>::
+ Use open-uri for downloading the source. Use webdiff for generating
+ diffs. This doesn't handle cookies and the like.
+
+<tt>:mechanize</tt>::
+ Use mechanize (must be installed) for downloading the source. Use
+ webdiff for generating diffs. This calls the URL's :mechanize property
+ (a lambda that takes 3 arguments: URL, agent, page => HTML as string)
+ to post-process the page (or if not available, use the page body's
+ HTML).
+
+<tt>:text</tt>::
+ This requires hpricot to be installed. Use open-uri for downloading
+ and hpricot for converting HTML to plain text. This still requires
+ diff as external helper.
+
+<tt>:body_html</tt>::
+ This requires hpricot to be installed. Use open-uri for downloading
+ the source, use only the body. Use webdiff for generating diffs. Try
+ to rewrite references (a, img) so that the point to the webpage. By
+ default, this will also strip tags like script, form, object ...
+
+<tt>:website</tt>::
+ Use :body_html to download the source. Follow all links referring to
+ the same host with the same file suffix. Use webdiff for generating
+ diff.
+
+<tt>:website_below</tt>::
+ Use :body_html to download the source. Follow all links referring to
+ the same host and a file below the top directory with the same file
+ suffix. Use webdiff for generating diff.
+
+<tt>:website_txt</tt>::
+ Use :website to download the source but convert the output to plain
+ text.
+
+<tt>:website_txt_below</tt>::
+ Use :website_below to download the source but convert the output to
+ plain text.
+
+<tt>:rss</tt>::
+ Download an rss feed, show changed items.
+
+<tt>:opml</tt>::
+ Experimental. Download the rss feeds registered in opml. No support
+ for atom yet.
+
+<tt>:img</tt>::
+ Download an image and display it in the output if it has changed
+ (according to diff). You can use hpricot to extract an image from a
+ HTML source. Example:
+
+Any shortcuts relying on :body_html will also try to rewrite any
+references so that the links point to the webpage.
+
+
+
+=== Example configuration file for demonstration purposes
+
+ # Daily
+ set :days => 1
+
+ # Use lynx instead of the default downloader (w3m).
+ source 'http://www.example.com', :days => 7, :download => :lynx
+
+ # Use the HTML body and process via webdiff.
+ source 'http://www.example.com', :use => :body_html,
+ :downloadprocess => lambda {|text| Hpricot(text).at('div#content').inner_html}
+
+ # Download a podcast
+ source 'http://www.example.com/podcast.xml', :title => 'Podcast',
+ :use => :rss,
+ :rss_enclosure => '/home/me/podcasts/example'
+
+ # Check a rss feed.
+ source 'http://www.example.com/news.xml', :title => 'News', :use => :rss
+
+ # Get rss feed info from an opml file (EXPERIMENTAL).
+ # @cfgdir is most likely '~/.websitary'.
+ source File.join(@cfgdir, 'news.opml'), :use => :opml
+
+
+ # Weekly
+ set :days => 7
+
+ # Consider the page body only from the 10th line downwards.
+ source 'http://www.example.com', :lines => 10..-1, :title => 'My Page'
+
+
+ # Bi-weekly
+ set :days => 14
+
+ # Use these urls with the default options.
+ source <<URLS
+ http://www.example.com
+ http://www.example.com/page.html
+ URLS
+
+ # Make HTML diffs and highlight occurences of a word
+ source 'http://www.example.com',
+ :title => 'Example',
+ :use => :body_html,
+ :diffprocess => highlighter(/word/i)
+
+ # Download the whole website below this path (only pages with
+ # html-suffix), wait 30 secs between downloads.
+ # Download only php and html pages
+ # Follow links 2 levels deep
+ source 'http://www.example.com/foo/bar.html',
+ :title => 'Example -- Bar',
+ :use => :website_below, :sleep => 30,
+ :match => /\.(php|html)\b/, :depth => 2
+
+ # Download images from some kind of daily-image site (check the user
+ # agreement first, if this is allowed). This may require some ruby
+ # hacking in order to extract the right url.
+ source 'http://www.example.com/daily_image/', :title => 'Daily Image',
+ :use => :img,
+ :download => lambda {|url|
+ rv = nil
+ # Read the HTML.
+ html = open(url) {|io| io.read}
+ # This check is probably unnecessary as the failure to read
+ # the HTML document would most likely result in an
+ # exception.
+ if html
+ # Parse the HTML document.
+ doc = Hpricot(html)
+ # The following could actually be simplified using xpath
+ # or css search expressions. This isn't the most elegant
+ # solution but it works with any value of ALT.
+ # This downloads the image <img src="..." alt="Current Image">
+ # Check all img tags in the HTML document.
+ for e in doc.search(%{//img})
+ # Is this the image we're looking for?
+ if e['alt'] == "Current Image"
+ # Make relative urls absolute
+ img = rewrite_href(e['src'], url)
+ # Get the actual image data
+ rv = open(img, 'rb') {|io| io.read}
+ # Exit the for loop
+ break
+ end
+ end
+ rv
+ end
+ }
+
+
+ unset :days
+
+
+
+=== Commands for use with the -e command-line option
+Most of these commands require you to name a profile on the command
+line. You can define default profiles with the "default" configuration
+command.
+
+If no command is given, "downdiff" is executed.
+
+add::
+ Add the URLs given on the command line to the quicklist profile.
+ ATTENTION: The following arguments on the command line are URLs, not
+ profile names.
+
+aggregate::
+ Retrieve information and save changes for later review.
+
+configuration::
+ Show the fully qualified configuration of each source.
+
+downdiff::
+ Download and show differences (DEFAULT)
+
+edit::
+ Edit the profile given on the command line (use vi by default)
+
+latest::
+ Show the latest copies of the sources from the profiles given
+ on the command line.
+
+ls::
+ List number of aggregated diffs.
+
+rebuild::
+ Rebuild the latest report.
+
+review::
+ Review the latest report (just show it with the browser)
+
+show::
+ Show previously aggregated items. A typical use would be to
+ periodically run in the background a command like
+ websitary -eaggregate newsfeeds
+ and then
+ websitary -eshow newsfeeds
+ to review the changes.
+
+unroll::
+ Undo the latest fetch.
+
+
+
+== TIPS:
+=== Ruby
+The profiles are regular ruby sources that are evaluated in the context
+of the configuration object (Websitary::Configuration). Find out more
+about ruby at:
+* http://www.ruby-lang.org/en/documentation/
+* http://www.ruby-doc.org/docs/ProgrammingRuby/ (especially
+ the
+ language[http://www.ruby-doc.org/docs/ProgrammingRuby/html/language.html]
+ chapter)
+
+
+=== Cygwin
+Mixing native Windows apps and cygwin apps can cause problems. The
+following settings (e.g. in ~/.websitary/config.rb) can be used to use
+a native Windows editor and browser:
+
+ # Use the default Windows programs (as if double-clicked)
+ view '/usr/bin/cygstart "%s"'
+
+ # Translate the profile filename and edit it with a native Windows editor
+ edit 'notepad.exe $(cygpath -w -- "%s")'
+
+ # Rewrite cygwin filenames for use with a native Windows browser
+ option :global, :file_url => lambda {|f| f.sub(/\/cygdrive\/.+?\/.websitary\//, '')}
+
+
+=== Windows
+Backslashes usually have to be escaped by backslashes -- or use slashes.
+I.e. instead of 'c:\foo\bar' write either 'c:\\foo\\bar' or
+'c:/foo/bar'.
+
+
+== REQUIREMENTS:
+websitary is a ruby-based application. You thus need a ruby
+interpreter.
+
+It depends on how you use websitary whether you actually need the
+following libraries, applications.
+
+By default this script expects the following applications to be
+present:
+
+* diff
+* vi (or some other editor)
+
+and one of:
+
+* w3m[http://w3m.sourceforge.net/] (default)
+* lynx[http://lynx.isc.org/]
+* links[http://links.twibright.com/]
+
+The use of :websec_webdiff as :diff application requires
+websec[http://baruch.ev-en.org/proj/websec/] (or at
+Savannah[http://savannah.nongnu.org/projects/websec/]) to be installed.
+By default, websitary uses it's own htmldiff class/script, which is less
+well tested and may return inferior results in comparison with websec's
+webdiff. In conjunction with :body_html, :openuri, or :curl, this will
+give you colored HTML diffs.
+
+For downloading HTML, you need one of these:
+
+* open-uri (should be part of ruby)
+* hpricot[http://code.whytheluckystiff.net/hpricot] (used e.g. by
+ :body_html, :website, and :website_below)
+* curl[http://curl.haxx.se/]
+* wget[http://www.gnu.org/software/wget/]
+
+The following ruby libraries are needed in conjunction with :body_html
+and :website related shortcuts:
+
+* hpricot[http://code.whytheluckystiff.net/hpricot] (parse HTML, use
+ only the body etc.)
+* robot_rules.rb[http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/177589]
+ for parsing robots.txt
+
+I personally would suggest to choose the following setup:
+
+* w3m[http://w3m.sourceforge.net/]
+* hpricot[http://code.whytheluckystiff.net/hpricot]
+* robot_rules.rb[http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/177589]
+
+
+== INSTALL:
+=== Use rubygems
+Run
+
+ gem install websitary
+
+This will download the package and install it.
+
+
+=== Use the zip
+The zip[http://rubyforge.org/frs/?group_id=4030] contains a file
+setup.rb that does the work. Run
+
+ ruby setup.rb
+
+
+=== Initial Configuration
+Please check the requirements section above and get the extra libraries
+needed:
+* hpricot
+* robot_rules.rb
+
+These could be installed by:
+
+ # Install hpricot
+ gem install hpricot
+
+ # Install robot_rules.rb
+ wget http://www.rubyquiz.com/quiz64_sols.zip
+ # Check the correct path to site_ruby first!
+ unzip -p quiz64_sols.zip "solutions/James Edward Gray II/robot_rules.rb" > /lib/ruby/site_ruby/1.8/robot_rules.rb
+ rm quiz64_sols.zip
+
+You might then want to create a profile ~/.websitary/config.rb that is
+loaded on every run. In this profile you could set the default output
+viewer and profile editor, as well as a default profile.
+
+Example:
+
+ # Load standard.rb if no profile is given on the command line.
+ default 'standard'
+
+ # Use cygwin's cygstart to view the output with the default HTML
+ # viewer
+ view '/usr/bin/cygstart "%s"'
+
+ # Use Windows gvim from cygwin ruby which is why we convert the path
+ # first
+ edit 'gvim $(cygpath -w -- "%s")'
+
+Where these configuration files reside, may differ. If the environment
+variable $HOME is defined, the default is $HOME/.websitary/ unless one
+of the following directories exist, which will then be used instead:
+
+* $USERPROFILE/websitary (on Windows)
+* SYSCONFDIR/websitary (where SYSCONFDIR usually is /etc but you can
+ run ruby to find out more:
+ <tt>ruby -e "p Config::CONFIG['sysconfdir']"</tt>)
+
+If neither directory exists and no $HOME variable is defined, the
+current directory will be used.
+
+Now check out the configuration commands in the Synopsis section.
+
+
+== LICENSE:
+websitary Webpage Monitor
+Copyright (C) 2007-2008 Thomas Link
+
+This program is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; either version 2 of the License, or
+(at your option) any later version.
+
+This program is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with this program; if not, write to the Free Software
+Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307
+USA
+
31 Rakefile
@@ -0,0 +1,31 @@
+# -*- ruby -*-
+
+require 'rubygems'
+require 'hoe'
+load './lib/websitary.rb'
+
+Hoe.new('websitary', Websitary::VERSION) do |p|
+ p.rubyforge_name = 'websitiary'
+ p.author = 'Tom Link'
+ p.email = 'micathom at gmail com'
+ p.summary = 'A unified website news, rss feed, podcast monitor'
+ p.description = p.paragraphs_of('README.txt', 2..5).join("\n\n")
+ p.url = p.paragraphs_of('README.txt', 0).first.split(/\n/)[1..-1]
+ p.changes = p.paragraphs_of('History.txt', 0..1).join("\n\n")
+ p.extra_deps << 'hpricot'
+ # p.need_tgz = false
+ p.need_zip = true
+end
+
+require 'rtagstask'
+RTagsTask.new
+
+task :ctags do
+ puts `ctags --extra=+q --fields=+i+S -R bin lib`
+end
+
+task :files do
+ puts `find bin lib -name "*.rb" > files.lst`
+end
+
+# vim: syntax=Ruby
43 bin/websitary
@@ -0,0 +1,43 @@
+#! /usr/bin/env ruby
+# websitary.rb -- The website news, rss feed, podcast catching monitor
+# @Last Change: 2008-02-12.
+# Author:: Thomas Link (micathom at gmail com)
+# License:: GPL (see http://www.gnu.org/licenses/gpl.txt)
+# Created:: 2007-06-09.
+
+
+require 'websitary'
+
+
+if __FILE__ == $0
+ w = Websitary::App.new(ARGV)
+ t = w.configuration.optval_get(:global, :timer)
+ if t
+ exit_code = 0
+ while exit_code <= 1
+ exit_code = Websitary::App.new(ARGV).process
+ case t
+ when Numeric
+ $logger.info "Sleep: #{t}s"
+ sleep t
+ when Proc
+ t.call
+ else
+ $logger.fatal "Malformed timer: #{t}"
+ exit_code = 5
+ break
+ end
+ end
+ else
+ exit_code = w.process
+ end
+ exit exit_code
+ # sleep 5
+end
+
+
+
+# vi: ft=ruby:tw=72:ts=2:sw=4
+# Local Variables:
+# revisionRx: REVISION\s\+=\s\+\'
+# End:
30 index.txt
@@ -0,0 +1,30 @@
+% #VAR: css=tabbar-top.css|screen, +serif.css
+% #VAR: tabBarPos=top tabEqualWidths! noTabBarButtons!
+
+#VAR: css=tabbar-right.css|screen, article.css|print, +serif.css
+#VAR: autoindex! buttonsColour=blue buttonsHighlight!
+#VAR: encoding=latin-1
+
+#VAR: tabBarHomeName=OVERVIEW:
+#VAR: headings=plain autoFileNames!
+#VAR: urlIcon=remote.png mailtoIcon=mailto.png markerInFrontOfURL!
+#VAR: baseUrl=http://websitiary.rubyforge.org/ baseUrlStripDir=1
+#VAR: levelshift=-1 codeSyntax=ruby codeStyle=tomacs
+#Var id=tabBar <<--
+[auto]
+API: | http://websitiary.rubyforge.org/websitary/
+--
+
+#TITLE: websitary
+#AUTHOR: Thomas Link
+% #DATE: today
+#MAKETITLE
+#LIST plain! max=2: contents
+
+#INC inputFormat=rdoc: README.txt
+
+
+% 2007-09-01; @Last Change: 2007-09-16.
+% vi: ft=viki:tw=72:ts=4
+% Local Variables:
+% End:
555 lib/websitary.rb
@@ -0,0 +1,555 @@
+# websitary.rb
+# @Last Change: 2008-03-11.
+# Author:: Thomas Link (micathom AT gmail com)
+# License:: GPL (see http://www.gnu.org/licenses/gpl.txt)
+# Created:: 2007-09-08.
+
+
+require 'cgi'
+require 'digest/md5'
+# require 'ftools'
+require 'fileutils'
+require 'net/ftp'
+require 'optparse'
+require 'pathname'
+require 'rbconfig'
+require 'uri'
+require 'open-uri'
+require 'timeout'
+require 'yaml'
+require 'rss'
+
+['hpricot', 'robot_rules'].each do |f|
+ begin
+ require f
+ rescue Exception => e
+ $stderr.puts <<EOT
+#{e.message}
+Library could not be loaded: #{f}
+Please see the requirements section at: http://websitiary.rubyforge.org
+EOT
+ end
+end
+
+
+module Websitary
+ APPNAME = 'websitary'
+ VERSION = '0.5'
+ REVISION = '2476'
+end
+
+require 'websitary/applog'
+require 'websitary/filemtimes'
+require 'websitary/configuration'
+require 'websitary/htmldiff'
+
+
+# Basic usage:
+# Websitary::App.new(ARGV).process
+class Websitary::App
+ MINUTE_SECS = 60
+ HOUR_SECS = MINUTE_SECS * 60
+ DAY_SECS = HOUR_SECS * 24
+
+
+ # Hash: The output of the diff commands for each url.
+ attr_reader :difftext
+
+ # The configurator
+ attr_reader :configuration
+
+ # Secs until next update.
+ attr_reader :tdiff_min
+
+
+ # args:: Array of command-line (like) arguments.
+ def initialize(args=[])
+ @configuration = Websitary::Configuration.new(self, args)
+ @difftext = {}
+ @tdiff_min = nil
+
+ ensure_dir(@configuration.cfgdir)
+ css = File.join(@configuration.cfgdir, 'websitary.css')
+ unless File.exists?(css)
+ $logger.info "Copying default css file: #{css}"
+ @configuration.write_file(css, 'w') do |io|
+ io.puts @configuration.opt_get(:page, :css)
+ end
+ end
+ end
+
+
+ # Run the command stored in @execute.
+ def process
+ begin
+ m = "execute_#{@configuration.execute}"
+ if respond_to?(m)
+ exit_code = send(m)
+ else
+ $logger.fatal "Unknown command: #{@configuration.execute}"
+ exit_code = 5
+ end
+ ensure
+ @configuration.mtimes.swap_out
+ end
+ return exit_code
+ end
+
+
+ # Show the currently configured URLs
+ def execute_configuration
+ keys = @configuration.options.keys
+ urls = @configuration.todo
+ # urls = @configuration.todo..sort {|a,b| @configuration.url_get(a, :title, a) <=> @configuration.url_get(b, :title, b)}
+ urls.each_with_index do |url, i|
+ data = @configuration.urls[url]
+ text = [
+ "<b>URL</b><br/>#{url}<br/>",
+ "<b>current</b><br/>#{CGI.escapeHTML(@configuration.latestname(url, true))}<br/>",
+ "<b>backup</b><br/>#{CGI.escapeHTML(@configuration.oldname(url, true))}<br/>",
+ *((data.keys | keys).map do |k|
+ v = @configuration.url_get(url, k).inspect
+ "<b>:#{k}</b><br/>#{CGI.escapeHTML(v)}<br/>"
+ end)
+ ]
+ accumulate(url, text.join("<br/>"))
+ end
+ return show
+ end
+
+
+ def cmdline_arg_add(configuration, url)
+ configuration.to_do url
+ end
+
+
+ def execute_add
+ if @configuration.quicklist_profile
+ quicklist = @configuration.profile_filename(@configuration.quicklist_profile, false)
+ $logger.info "Use quicklist file: #{quicklist}"
+ if quicklist
+ @configuration.write_file(quicklist, 'a') do |io|
+ @configuration.todo.each do |url|
+ io.puts %{source #{url.inspect}}
+ end
+ end
+ return 0
+ end
+ end
+ $logger.fatal 'No valid quick-list profile defined'
+ exit 5
+ end
+
+
+ # Restore previous backups
+ def execute_unroll
+ @configuration.todo.each do |url|
+ latest = @configuration.latestname(url, true)
+ backup = @configuration.oldname(url, true)
+ if File.exist?(backup)
+ $logger.warn "Restore: #{url}"
+ $logger.debug "Copy: #{backup} => #{latest}"
+ copy(backup, latest)
+ end
+ end
+ return 0
+ end
+
+
+ # Edit currently chosen profiles
+ def execute_edit
+ @configuration.edit_profile
+ exit 0
+ end
+
+
+ # Show the latest report
+ def execute_review
+ @configuration.view_output
+ 0
+ end
+
+
+ # Show the current version of all urls
+ def execute_latest
+ @configuration.todo.each do |url|
+ latest = @configuration.latestname(url)
+ text = File.read(latest)
+ accumulate(url, text)
+ end
+ return show
+ end
+
+
+ # Rebuild the report from the already downloaded copies.
+ def execute_rebuild
+ execute_downdiff(true, true)
+ end
+
+
+ # Aggregate data for later review (see #execute_show)
+ def execute_aggregate
+ rv = execute_downdiff(false) do |url, difftext, opts|
+ if difftext and !difftext.empty?
+ aggrbase = @configuration.encoded_filename('aggregate', url, true, 'md5')
+ aggrext = Digest::MD5.hexdigest(Time.now.to_s)
+ aggrfile = [aggrbase, aggrext].join('_')
+ @configuration.write_file(aggrfile) {|io| io.puts difftext}
+ end
+ end
+ clean_diffs
+ rv
+ end
+
+
+ def execute_ls
+ rv = 0
+ @configuration.todo.each do |url|
+ opts = @configuration.urls[url]
+ name = @configuration.url_get(url, :title, url)
+ $logger.debug "Source: #{name}"
+ aggrbase = @configuration.encoded_filename('aggregate', url, true, 'md5')
+ aggrfiles = Dir["#{aggrbase}_*"]
+ aggrn = aggrfiles.size
+ if aggrn > 0
+ puts "%3d - %s" % [aggrn, name]
+ rv = 1
+ end
+ end
+ rv
+ end
+
+
+ # Show data collected by #execute_aggregate
+ def execute_show
+ @configuration.todo.each do |url|
+ opts = @configuration.urls[url]
+ $logger.debug "Source: #{@configuration.url_get(url, :title, url)}"
+ aggrbase = @configuration.encoded_filename('aggregate', url, true, 'md5')
+ difftext = []
+ aggrfiles = Dir["#{aggrbase}_*"]
+ aggrfiles.each do |file|
+ difftext << File.read(file)
+ end
+ difftext.compact!
+ difftext.delete('')
+ unless difftext.empty?
+ joindiffs = @configuration.url_get(url, :joindiffs, lambda {|t| t.join("\n")})
+ difftext = @configuration.call_cmd(joindiffs, [difftext], :url => url) if joindiffs
+ accumulate(url, difftext, opts)
+ end
+ aggrfiles.each do |file|
+ File.delete(file)
+ end
+ end
+ show
+ end
+
+
+ # Process the sources in @configuration.url as defined by profiles
+ # and command-line options. The differences are stored in @difftext (a Hash).
+ # show_output:: If true, show the output with the defined viewer.
+ def execute_downdiff(show_output=true, rebuild=false, &accumulator)
+ if @configuration.todo.empty?
+ $logger.error 'Nothing to do'
+ return 5
+ end
+ @configuration.todo.each do |url|
+ opts = @configuration.urls[url]
+ $logger.debug "Source: #{@configuration.url_get(url, :title, url)}"
+
+ diffed = @configuration.diffname(url, true)
+ $logger.debug "diffname: #{diffed}"
+
+ if File.exists?(diffed)
+ $logger.warn "Reuse old diff: #{@configuration.url_get(url, :title, url)} => #{diffed}"
+ difftext = File.read(diffed)
+ accumulate(url, difftext, opts)
+ else
+ latest = @configuration.latestname(url, true)
+ $logger.debug "latest: #{latest}"
+ next unless rebuild or !skip_url?(url, latest, opts)
+
+ older = @configuration.oldname(url, true)
+ $logger.debug "older: #{older}"
+
+ begin
+ if rebuild or download(url, opts, latest, older)
+ difftext = diff(url, opts, latest, older)
+ if difftext
+ @configuration.write_file(diffed, 'wb') {|io| io.puts difftext}
+ # $logger.debug "difftext: #{difftext}" #DBG#
+ if accumulator
+ accumulator.call(url, difftext, opts)
+ else
+ accumulate(url, difftext, opts)
+ end
+ end
+ end
+ rescue Exception => e
+ $logger.error e.to_s
+ $logger.info e.backtrace.join("\n")
+ end
+ end
+ end
+ return show_output ? show : @difftext.empty? ? 0 : 1
+ end
+
+
+ def move(from, to)
+ # copy_move(:rename, from, to) # ftools
+ copy_move(:mv, from, to) # FileUtils
+ end
+
+
+ def copy(from, to)
+ # copy_move(:copy, from, to)
+ copy_move(:cp, from, to)
+ end
+
+
+ def copy_move(method, from, to)
+ if File.exists?(from)
+ $logger.debug "Overwrite: #{from} -> #{to}" if File.exists?(to)
+ lst = File.lstat(from)
+ FileUtils.send(method, from, to)
+ File.utime(lst.atime, lst.mtime, to)
+ @configuration.mtimes.set(from, lst.mtime)
+ @configuration.mtimes.set(to, lst.mtime)
+ end
+ end
+
+
+ def format_tdiff(secs)
+ d = (secs / DAY_SECS).to_i
+ if d > 0
+ return "#{d}d"
+ else
+ d = (secs / HOUR_SECS).to_i
+ return "#{d}h"
+ end
+ end
+
+
+ def ensure_dir(dir, fatal_nondir=true)
+ if File.exist?(dir)
+ unless File.directory?(dir)
+ if fatal_nondir
+ $logger.fatal "Not a directory: #{dir}"
+ exit 5
+ else
+ $logger.info "Not a directory: #{dir}"
+ return false
+ end
+ end
+ else
+ parent = Pathname.new(dir).parent.to_s
+ ensure_dir(parent, fatal_nondir) unless File.directory?(parent)
+ Dir.mkdir(dir)
+ end
+ return true
+ end
+
+
+ private
+
+ def download(url, opts, latest, older=nil)
+ if @configuration.done.include?(url)
+ $logger.info "Already downloaded: #{@configuration.url_get(url, :title, url).inspect}"
+ return false
+ end
+
+ $logger.warn "Download: #{@configuration.url_get(url, :title, url).inspect}"
+ @configuration.done << url
+ text = @configuration.call_cmd(@configuration.url_get(url, :download), [url], :url => url)
+ # $logger.debug text #DBG#
+ unless text
+ $logger.warn "no contents: #{@configuration.url_get(url, :title, url)}"
+ return false
+ end
+
+ if opts
+ if (sleepsecs = opts[:sleep])
+ sleep sleepsecs
+ end
+ text = text.split("\n")
+ if (range = opts[:lines])
+ $logger.debug "download: lines=#{range}"
+ text = text[range] || []
+ end
+ if (range = opts[:cols])
+ $logger.debug "download: cols=#{range}"
+ text.map! {|l| l[range]}
+ text.compact!
+ end
+ if (o = opts[:sort])
+ $logger.debug "download: sort=#{o}"
+ case o
+ when true
+ text.sort!
+ when Proc
+ text.sort!(&o)
+ end
+ end
+ if (o = opts[:strip])
+ $logger.debug "download: strip!"
+ text.delete_if {|l| l !~ /\S/}
+ end
+ text = text.join("\n")
+ end
+
+ pprc = @configuration.url_get(url, :downloadprocess)
+ if pprc
+ $logger.debug "download process: #{pprc}"
+ text = @configuration.call_cmd(pprc, [text], :url => url)
+ # $logger.debug text #DBG#
+ end
+
+ if text and !text.empty?
+ if older
+ if File.exist?(latest)
+ move(latest, older)
+ elsif !File.exist?(older)
+ $logger.warn "Initial copy: #{latest.inspect}"
+ end
+ end
+ @configuration.write_file(latest) {|io| io.puts(text)}
+ return true
+ else
+ return false
+ end
+ end
+
+
+ def diff(url, opts, new, old)
+ if File.exists?(old)
+ $logger.debug "diff: #{old} <-> #{new}"
+ difftext = @configuration.call_cmd(@configuration.url_get(url, :diff), [old, new], :url => url)
+ # $logger.debug "diff: #{difftext}" #DBG#
+
+ if difftext =~ /\S/
+ if (pprc = @configuration.url_get(url, :diffprocess))
+ $logger.debug "diff process: #{pprc}"
+ difftext = @configuration.call_cmd(pprc, [difftext], :url => url)
+ end
+ # $logger.debug "difftext: #{difftext}" #DBG#
+ if difftext =~ /\S/
+ $logger.warn "Changed: #{@configuration.url_get(url, :title, url).inspect}"
+ return difftext
+ end
+ end
+
+ $logger.debug "Unchanged: #{@configuration.url_get(url, :title, url).inspect}"
+
+ elsif File.exist?(new) and
+ (@configuration.url_get(url, :show_initial) or @configuration.optval_get(:global, :show_initial))
+
+ return File.read(new)
+
+ end
+ return nil
+ end
+
+
+ def skip_url?(url, latest, opts)
+ if File.exists?(latest) and !opts[:ignore_age]
+ tn = Time.now
+ tl = @configuration.mtimes.mtime(latest)
+ td = tn - tl
+ tdiff = tdiff_with(opts, tn, tl)
+ case tdiff
+ when nil, false
+ $logger.debug "Age requirement fulfilled: #{@configuration.url_get(url, :title, url).inspect}: #{format_tdiff(td)} old"
+ return false
+ when :skip, true
+ $logger.info "Skip #{@configuration.url_get(url, :title, url).inspect}: Only #{format_tdiff(td)} old"
+ return true
+ when Numeric
+ if td < tdiff
+ tdd = tdiff - td
+ @tdiff_min = tdd if @tdiff_min.nil? or tdd < @tdiff_min
+ $logger.info "Skip #{@configuration.url_get(url, :title, url).inspect}: Only #{format_tdiff(td)} old (#{format_tdiff(tdiff)})"
+ return true
+ end
+ else
+ $logger.fatal "Internal error: tdiff=#{tdiff.inspect}"
+ exit 5
+ end
+ end
+ end
+
+
+ def tdiff_with(opts, tn, tl)
+ if (hdiff = opts[:hours])
+ tdiff = hdiff * HOUR_SECS
+ $logger.debug "hours: #{hdiff} (#{tdiff}s)"
+ elsif (daily = opts[:daily])
+ tdiff = tl.year == tn.year && tl.yday == tn.yday
+ $logger.debug "daily: #{tl} <=> #{tn} (#{tdiff})"
+ elsif (dweek = opts[:days_of_week] || opts[:wdays])
+ tdiff = tdiff_x_of_y(dweek, tn.wday, tn.yday / 7, tl.yday / 7)
+ $logger.debug "wdays: #{dweek} (#{tdiff})"
+ elsif (dmonth = opts[:days_of_month] || opts[:mdays])
+ tdiff = tdiff_x_of_y(dmonth, tn.day, tn.month, tl.month)
+ $logger.debug "mdays: #{dmonth} (#{tdiff})"
+ elsif (ddiff = opts[:days])
+ tdiff = ddiff * DAY_SECS
+ $logger.debug "days: #{ddiff} (#{tdiff}s)"
+ elsif (dmonth = opts[:months])
+ tnowm = tn.month + 12 * (tn.year - tl.year)
+ tlm = tl.month
+ tdiff = (tnowm - tlm) < dmonth
+ $logger.debug "months: #{dmonth} (#{tdiff})"
+ else
+ tdiff = false
+ end
+ return tdiff
+ end
+
+
+ def tdiff_x_of_y(eligible, now, parent_eligible, parent_now)
+ if parent_eligible == parent_now
+ return true
+ else
+ case eligible
+ when Array, Range
+ return !eligible.include?(now)
+ when Integer
+ return eligible != now
+ else
+ $logger.error "#{@configuration.url_get(url, :title, url)}: Wrong type for :days_of_week=#{dweek.inspect}"
+ return :skip
+ end
+ end
+ end
+
+
+ def accumulate(url, difftext, opts=nil)
+ # opts ||= @configuration.urls[url]
+ @difftext[url] = difftext
+ end
+
+
+ def show
+ begin
+ return @configuration.show_output(@difftext)
+ ensure
+ clean_diffs
+ end
+ end
+
+
+ def clean_diffs
+ Dir[File.join(@configuration.cfgdir, 'diff', '*')].each do |f|
+ $logger.debug "Delete saved diff: #{f}"
+ File.delete(f)
+ end
+ end
+
+end
+
+
+
+# Local Variables:
+# revisionRx: REVISION\s\+=\s\+\'
+# End:
39 lib/websitary/applog.rb
@@ -0,0 +1,39 @@
+# applog.rb
+# @Last Change: 2007-09-11.
+# Author:: Thomas Link (micathom AT gmail com)
+# License:: GPL (see http://www.gnu.org/licenses/gpl.txt)
+# Created:: 2007-09-08.
+
+require 'logger'
+
+
+# A simple wrapper around Logger.
+class Websitary::AppLog
+ def initialize(output=nil)
+ @output = output || $stdout
+ $logger = Logger.new(@output, 'daily')
+ $logger.progname = Websitary::APPNAME
+ $logger.datetime_format = "%H:%M:%S"
+ set_level
+ end
+
+
+ def set_level(level=:default)
+ case level
+ when :debug
+ $logger.level = Logger::DEBUG
+ when :verbose
+ $logger.level = Logger::INFO
+ when :quiet
+ $logger.level = Logger::ERROR
+ else
+ $logger.level = Logger::WARN
+ end
+ $logger.debug "Set logger level: #{level}"
+ end
+end
+
+
+# Local Variables:
+# revisionRx: REVISION\s\+=\s\+\'
+# End:
1,903 lib/websitary/configuration.rb
@@ -0,0 +1,1903 @@
+# configuration.rb
+# @Last Change: 2009-05-25.
+# Author:: Thomas Link (micathom AT gmail com)
+# License:: GPL (see http://www.gnu.org/licenses/gpl.txt)
+# Created:: 2007-09-08.
+
+
+
+# This class defines the scope in which profiles are evaluated. Most
+# of its methods are suitable for use in profiles.
+class Websitary::Configuration
+ # Hash (key = URL, value = Hash of options)
+ attr_accessor :urls
+ # Array of urls to be downloaded.
+ attr_reader :todo
+ # Array of downloaded urls.
+ attr_accessor :done
+ # The user configuration directory
+ attr_accessor :cfgdir
+ # What to do
+ attr_accessor :execute
+ # Global Options
+ attr_accessor :options
+ # Cached mtimes
+ attr_accessor :mtimes
+ # The name of the quicklist profile
+ attr_accessor :quicklist_profile
+ # attr_accessor :default_profiles
+ # attr_accessor :cmd_edit
+
+
+ def initialize(app, args=[])
+ @logger = Websitary::AppLog.new
+ $logger.debug "Configuration#initialize"
+ @app = app
+ @cfgdir = ENV['HOME'] ? File.join(ENV['HOME'], '.websitary') : '.'
+ [
+ ENV['USERPROFILE'] && File.join(ENV['USERPROFILE'], 'websitary'),
+ File.join(Config::CONFIG['sysconfdir'], 'websitary')
+ ].each do |dir|
+ if File.exists?(dir)
+ @cfgdir = dir
+ break
+ end
+ end
+
+ @cmd_edit = 'vi "%s"'
+ @execute = 'downdiff'
+ @quicklist_profile = 'quicklist'
+ @view = 'w3m "%s"'
+
+ @allow = {}
+ @default_options = {}
+ @default_profiles = [@quicklist_profile]
+ @done = []
+ @mtimes = Websitary::FileMTimes.new(self)
+ @options = {}
+ @outfile = {}
+ @profiles = []
+ @robots = {}
+ @todo = []
+ @exclude = [/^\s*(javascript|mailto):/]
+ @urlencmap = {}
+ @urls = {}
+
+ @suffix = {
+ 'text' => 'txt'
+ # 'rss' => 'xml'
+ }
+
+ migrate
+ initialize_options
+ profile 'config.rb'
+ parse_command_line_args(args)
+
+ @output_format ||= ['html']
+ @output_title = %{#{Websitary::APPNAME}: #{@profiles.join(", ")}}
+ end
+
+
+ def parse_command_line_args(args)
+ $logger.debug "parse_command_line_args: #{args}"
+ opts = OptionParser.new do |opts|
+ opts.banner = "Usage: #{Websitary::APPNAME} [OPTIONS] [PROFILES] > [OUT]"
+ opts.separator ''
+ opts.separator "#{Websitary::APPNAME} is a free software with ABSOLUTELY NO WARRANTY under"
+ opts.separator 'the terms of the GNU General Public License version 2 or newer.'
+ opts.separator ''
+
+ opts.separator 'General Options:'
+
+ opts.on('-c', '--cfg=DIR', String, 'Configuration directory') do |value|
+ @cfgdir = value
+ end
+
+ opts.on('-e', '--execute=COMMAND', String, 'Define what to do (default: downdiff)') do |value|
+ @execute = value
+ end
+
+ # opts.on('-E', '--edit=PROFILE', String, 'Edit a profile') do |value|
+ # edit_profile value
+ # exit 0
+ # end
+
+ opts.on('-f', '--output-format=FORMAT', 'Output format (html, text, rss)') do |value|
+ output_format(*value.split(/,/))
+ end
+
+ opts.on('--[no-]ignore-age', 'Ignore age limits') do |bool|
+ set :ignore_age => bool
+ end
+
+ opts.on('--log=DESTINATION', String, 'Log destination') do |value|
+ @logger = Websitary::AppLog.new(value != '-' && value)
+ end
+
+ opts.on('-o', '--output=FILENAME', String, 'Output') do |value|
+ output_file(value)
+ end
+
+ opts.on('-s', '--set=NAME=VAR', String, 'Set a default option') do |value|
+ key, val = value.split(/=/, 2)
+ set key.intern => eval(val)
+ end
+
+ opts.on('-t', '--timer=N', Numeric, 'Repeat every N seconds (never exit)') do |value|
+ global(:timer => value)
+ end
+
+ opts.on('-x', '--exclude=N', Regexp, 'Exclude URLs matching this pattern') do |value|
+ exclude(Regexp.new(value))
+ end
+
+ opts.separator ''
+ opts.separator "Available commands (default: #@execute):"
+ commands = @app.methods.map do |m|
+ mt = m.match(/^execute_(.*)$/)
+ mt && mt[1]
+ end
+ commands.compact!
+ commands.sort!
+ opts.separator commands.join(', ')
+
+ opts.separator ''
+ opts.separator 'Available profiles:'
+ opts.separator Dir[File.join(@cfgdir, '*.rb')].map {|f| File.basename(f, '.*')}.join(', ')
+
+ opts.separator ''
+ opts.separator 'Other Options:'
+
+ opts.on('--debug', 'Show debug messages') do |v|
+ $VERBOSE = $DEBUG = true
+ @logger.set_level(:debug)
+ end
+
+ opts.on('-q', '--quiet', 'Be mostly quiet') do |v|
+ @logger.set_level(:quiet)
+ end
+
+ opts.on('-v', '--verbose', 'Run verbosely') do |v|
+ $VERBOSE = true
+ @logger.set_level(:verbose)
+ end
+
+ opts.on('--version', 'Run verbosely') do |v|
+ puts Websitary::VERSION
+ exit 1
+ end
+
+ opts.on_tail('-h', '--help', 'Show this message') do
+ puts opts
+ exit 1
+ end
+ end
+
+ @profiles = opts.parse!(args)
+ @profiles = @default_profiles if @profiles.empty?
+ cla_handler = "cmdline_arg_#{@execute}"
+ cla_handler = nil unless @app.respond_to?(cla_handler)
+ for pn in @profiles
+ if cla_handler
+ @app.send(cla_handler, self, pn)
+ else
+ profile pn
+ end
+ end
+
+ self
+ end
+
+
+ def url_set(url, items)
+ opts = @urls[url] ||= {}
+ opts.merge!(items)
+ end
+
+
+ # Retrieve an option for an url
+ # url:: String
+ # opt:: Symbol
+ def url_get(url, opt, default=nil)
+ opts = @urls[url]
+ unless opts
+ $logger.debug "Non-registered URL: #{url}"
+ return default
+ end
+ $logger.debug "get: opts=#{opts.inspect}"
+ case opt
+ when :diffprocess, :format
+ opt_ = opts.has_key?(opt) ? opt : :diff
+ else
+ opt_ = opt
+ end
+
+ $logger.debug "get: opt=#{opt} opt_=#{opt_}"
+ $logger.debug "get: #{opts[opt_]} #{opts[:use]}" if opts
+ if opts.has_key?(opt_)
+ val = opts[opt_]
+ elsif opts.has_key?(:use)
+ val = opts[:use]
+ else
+ val = nil
+ end
+
+ case val
+ when nil
+ when Symbol
+ $logger.debug "get: val=#{val}"
+ success, rv = opt_get(opt, val)
+ $logger.debug "get: #{success}, #{rv}"
+ if success
+ return rv
+ end
+ else
+ $logger.debug "get: return val=#{val}"
+ return val
+ end
+ unless default
+ success, default1 = opt_get(opt, :default)
+ default = default1 if success
+ end
+
+ $logger.debug "get: return default=#{default}"
+ return default
+ end
+
+
+ def optval_get(opt, val, default=nil)
+ case val
+ when Symbol
+ ok, val = opt_get(opt, val)
+ if ok
+ val
+ else
+ default
+ end
+ else
+ val
+ end
+ end
+
+
+ def opt_get(opt, val)
+ vals = @options[opt]
+ $logger.debug "val=#{val} vals=#{vals.inspect}"
+ if vals and vals.has_key?(val)
+ rv = vals[val]
+ $logger.debug "opt_get ok: #{opt} => #{rv.inspect}"
+ case rv
+ when Symbol
+ $logger.debug "opt_get re: #{rv}"
+ return opt_get(opt, rv)
+ else
+ $logger.debug "opt_get true, #{rv}"
+ return [true, rv]
+ end
+ else
+ $logger.debug "opt_get no: #{opt} => #{val.inspect}"
+ return [false, val]
+ end
+ end
+
+
+ # Configuration command:
+ # Set the default profiles
+ def default(*profile_names)
+ @default_profiles = profile_names
+ end
+
+
+ def quicklist(profile_name)
+ @quicklist_profile = profile_name
+ end
+
+
+ # Configuration command:
+ # Load a profile
+ def profile(profile_name)
+ case profile_name
+ when '-'
+ readlines.map! {|l| l.chomp}.each {|url| source url}
+ when '__END__'
+ $logger.debug "Profile: __END__"
+ contents = DATA.read
+ return eval_profile(contents)
+ else
+ fn = profile_filename(profile_name)
+ if fn
+ $logger.debug "Profile: #{fn}"
+ contents = File.read(fn)
+ return eval_profile(contents, fn)
+ else
+ $logger.error "Unknown profile: #{profile_name}"
+ end
+ end
+ return false
+ end
+
+
+ # Define a options shortcut.
+ def shortcut(symbol, args)
+ ak = args.keys
+ ok = @options.keys
+ dk = ok - ak
+
+ # :downloadprocess
+ if !ak.include?(:delegate) and
+ dk.any? {|e| [:download, :downloadformat, :diff, :format, :diffprocess].include?(e)}
+ $logger.warn "Shortcut #{symbol}: Undefined fields: #{dk.inspect}"
+ end
+
+ if ak.include?(:delegate)
+ dk.each do |field|
+ @options[field][symbol] = args[:delegate]
+ end
+ end
+
+ args.each do |field, val|
+ @options[field][symbol] = val unless field == :delegate
+ end
+ end
+
+
+ def to_do(url)
+ @todo << url unless is_excluded?(url)
+ end
+
+
+ def is_excluded?(url)
+ rv = @exclude.any? {|p| url =~ p}
+ $logger.debug "is_excluded: #{url}: #{rv}"
+ rv
+ end
+
+
+ # Set the output format.
+ def output_format(*format)
+ unless format.all? {|e| ['text', 'html', 'rss'].include?(e)}
+ $logger.fatal "Unknown output format: #{format}"
+ exit 5
+ end
+ @output_format = format
+ end
+
+
+ # Set the output file.
+ def output_file(filename, outformat=nil)
+ @outfile[outformat] = filename
+ end
+
+
+ # Configuration command:
+ # Set global options.
+ # type:: Symbol
+ # options:: Hash
+ def option(type, options)
+ $logger.info "option #{type}: #{options.inspect}"
+ o = @options[type]
+ if o
+ o.merge!(options)
+ else
+ $logger.error "Unknown option type: #{type} (#{options.inspect})"
+ end
+ end
+
+
+ # Set a global option.
+ def global(options)
+ options.each do |type, value|
+ @options[:global][type] = value
+ end
+ end
+
+
+ # Configuration command:
+ # Set the default value for source-options.
+ def set(options)
+ $logger.debug "set: #{options.inspect}"
+ @default_options.merge!(options)
+ end
+
+
+ # Configuration command:
+ # Unset a default source-option.
+ def unset(*options)
+ for option in options
+ @default_options.delete(option)
+ end
+ end
+
+
+ # Configuration command:
+ # Define a source.
+ # urls:: String
+ def source(urls, opts={})
+ urls.split("\n").flatten.compact.each do |url|
+ url_set(url, @default_options.dup.update(opts))
+ to_do url
+ end
+ end
+
+
+ # Configuration command:
+ # Set the default download processor. The block takes the
+ # downloaded text (STRING) as argument.
+ def downloadprocess(&block)
+ @options[:downloadprocess][:default] = block
+ end
+
+
+ # Configuration command:
+ # Set the default diff processor. The block takes the
+ # diff text (STRING) as argument.
+ def diffprocess(&block)
+ @options[:diff][:default] = block
+ end
+
+
+ # Configuration command:
+ # Set the editor.
+ def edit(cmd)
+ @cmd_edit = cmd
+ end
+
+
+ # Configuration command:
+ # Add URL-exclusion patterns (REGEXPs or STRINGs).
+ def exclude(*urls)
+ @exclude += urls.map do |url|
+ case url
+ when Regexp
+ url
+ when String
+ Regexp.new(Regexp.escape(url))
+ else
+ $logger.fatal "Must be regexp or string: #{url.inspect}"
+ exit 5
+ end
+ end
+ end
+
+
+ # Configuration command:
+ # Set the viewer.
+ def view(view)
+ @view = view
+ end
+
+
+ # Configuration command:
+ # Set the default diff program.
+ def diff(diff)
+ @options[:diff][:default] = diff
+ end
+
+
+ # Configuration command:
+ # Set the default dowloader.
+ def download(download)
+ @options[:download][:default] = download
+ end
+
+
+ def format_text(url, text)
+ enc = url_get(url, :iconv)
+ if enc
+ denc = optval_get(:global, :encoding)
+ begin
+ require 'iconv'
+ text = Iconv.conv(denc, enc, text)
+ rescue Exception => e
+ $logger.error "IConv failed #{enc} => #{denc}: #{e}"
+ end
+ end
+ return text
+ end
+
+
+ # Format a diff according to URL's source options.
+ def format(url, difftext)
+ fmt = url_get(url, :format)
+ text = format_text(url, difftext)
+ eval_arg(fmt, [text], text)
+ end
+
+
+ # Apply some arguments to a format.
+ # format:: String or Proc
+ # args:: Array of Arguments
+ def eval_arg(format, args, default=nil, &process_string)
+ case format
+ when nil
+ return default
+ when Proc
+ # $logger.debug "eval proc: #{format} #{args.inspect}" #DBG#
+ $logger.debug "eval proc: #{format}/#{args.size}"
+ return format.call(*args)
+ else
+ ca = format % args
+ # $logger.debug "eval string: #{ca}" #DBG#
+ if process_string
+ return process_string.call(ca)
+ else
+ return ca
+ end
+ end
+ end
+
+
+ # Apply the argument to cmd (a format String or a Proc). If a
+ # String, execute the command.
+ def call_cmd(cmd, cmdargs, args={})
+ default = args[:default]
+ url = args[:url]
+ timeout = url ? url_get(url, :timeout) : nil
+ if timeout
+ begin
+ Timeout::timeout(timeout) do |timeout_length|
+ eval_arg(cmd, cmdargs, default) {|cmd| `#{cmd}`}
+ end
+ rescue Timeout::Error
+ $logger.error "Timeout #{timeout}: #{url}"
+ return default
+ end
+ else
+ eval_arg(cmd, cmdargs, default) {|cmd| `#{cmd}`}
+ end
+ end
+
+
+ # Generate & view the final output.
+ # difftext:: Hash
+ def show_output(difftext)
+ if difftext.empty?
+ msg = ['No news is good news']
+ msg << "try again in #{@app.format_tdiff(@app.tdiff_min)}" if @app.tdiff_min
+ $logger.warn msg.join('; ')
+ return 0
+ end
+
+ @output_format.each do |outformat|
+ meth = "get_output_#{outformat}"
+
+ unless respond_to?(meth)
+ $logger.fatal "Unknown output format: #{outformat}"
+ exit 5
+ end
+
+ out = send(meth, difftext)
+ if out
+ outfile = get_outfile(outformat)
+ case outfile
+ when '-'
+ puts out
+ else
+ write_file(outfile) {|io| io.puts out}
+ meth = "view_output_#{outformat}"
+ self.send(meth, outfile)
+ end
+ end
+ end
+ return 1
+ end
+
+
+ def get_output_text(difftext)
+ difftext.map do |url, difftext|
+ if difftext
+ difftext = html_to_text(difftext) if is_html?(difftext)
+ !difftext.empty? && [
+ eval_arg(url_get(url, :rewrite_link, '%s'), [url]),
+ difftext_annotation(url),
+ nil,
+ difftext
+ ].join("\n")
+ end
+ end.compact.join("\n\n#{('-' * 68)}\n\n")
+ end
+
+
+ def get_output_rss(difftext)
+ success, rss_url = opt_get(:rss, :url)
+ if success
+ success, rss_version = opt_get(:rss, :version)
+ # require "rss/#{rss_version}"
+
+ rss = RSS::Rss.new(rss_version)
+ chan = RSS::Rss::Channel.new
+ chan.title = @output_title
+ [:description, :copyright, :category, :language, :image, :webMaster, :pubDate].each do |field|
+ ok, val = opt_get(:rss, field)
+ item.send(format_symbol(field, '%s='), val) if ok
+ end
+ chan.link = rss_url
+ rss.channel = chan
+
+ cnt = difftext.map do |url, text|
+ rss_format = url_get(url, :rss_format, 'plain_text')
+ text = strip_tags(text, :format => rss_format)
+ next if text.empty?
+
+ item = RSS::Rss::Channel::Item.new
+ item.date = Time.now
+ item.title = url_get(url, :title, File.basename(url))
+ item.link = eval_arg(url_get(url, :rewrite_link, '%s'), [url])
+ [:author, :date, :enclosure, :category, :pubDate].each do |field|
+ val = url_get(url, format_symbol(field, 'rss_%s'))
+ item.send(format_symbol(field, '%s='), val) if val
+ end
+
+ annotation = difftext_annotation(url)
+ annotation = "<pre>#{annotation}</pre>" if annotation
+ case rss_format
+ when 'plain_text'
+ item.description = %{#{annotation}<pre>#{text}</pre>}
+ else
+ item.description = %{#{annotation}\n#{text}}
+ end
+ chan.items << item
+ end
+
+ return rss.to_s
+
+ else
+
+ $logger.fatal "Global option :rss[:url] not defined."
+ exit 5
+
+ end
+ end
+
+
+ def get_output_html(difftext)
+ difftext = difftext.map do |url, text|
+ tags = url_get(url, :strip_tags)
+ text = strip_tags(text, :tags => tags) if tags
+ text.empty? ? nil : [url, text]
+ end
+ difftext.compact!
+ sort_difftext!(difftext)
+
+ toc = difftext.map do |url, text|
+ ti = url_get(url, :title, File.basename(url))
+ tid = html_toc_id(url)
+ bid = html_body_id(url)
+ %{<li id="#{tid}" class="toc"><a class="toc" href="\##{bid}">#{ti}</a></li>}
+ end.join("\n")
+
+ idx = 0
+ cnt = difftext.map do |url, text|
+ idx += 1
+ ti = url_get(url, :title, File.basename(url))
+ bid = html_body_id(url)
+ if (rewrite = url_get(url, :rewrite_link))
+ urlr = eval_arg(rewrite, [url])
+ ext = ''
+ else
+ old = %{<a class="old" href="#{file_url(oldname(url))}">old</a>}
+ lst = %{<a class="latest" href="#{file_url(latestname(url))}">latest</a>}
+ ext = %{ (#{old}, #{lst})}
+ urlr = url
+ end
+ note = difftext_annotation(url)
+ onclick = optval_get(:global, :toggle_body) ? 'onclick="ToggleBody(this)"' : ''
+ <<HTML
+<div id="#{bid}" class="webpage" #{onclick}>
+<div class="count">
+#{idx}
+</div>
+<h1 class="diff">
+<a class="external" href="#{urlr}">#{format_text(url, ti)}</a>#{ext}
+</h1>
+<div id="#{bid}_body">
+<div class="annotation">
+#{note && CGI::escapeHTML(note)}
+</div>
+<div class="diff,difftext">
+#{format(url, text)}
+</div>
+</div>
+</div>
+HTML
+ end.join(('<hr class="separator"/>') + "\n")
+
+ success, template = opt_get(:page, :format)
+ unless success
+ success, template = opt_get(:page, :simple)
+ end
+ return eval_arg(template, [@output_title, toc, cnt])
+ end
+
+
+ # Get the diff filename.
+ def diffname(url, ensure_dir=false)
+ encoded_filename('diff', url, ensure_dir, 'md5')
+ end
+
+
+ # Get the backup filename.
+ def oldname(url, ensure_dir=false, type=nil)
+ encoded_filename('old', url, ensure_dir, type)
+ end
+
+
+ # Get the filename for the freshly downloaded copy.
+ def latestname(url, ensure_dir=false, type=nil)
+ encoded_filename('latest', url, ensure_dir, type)
+ end
+
+
+ def url_from_filename(filename)
+ rv = @urlencmap[filename]
+ if rv
+ $logger.debug "Map filename: #{filename} -> #{rv}"
+ else
+ $logger.warn "Unmapped filename: #{filename}"
+ end
+ rv
+ end
+
+
+ def encoded_filename(dir, url, ensure_dir=false, type=nil)
+ type ||= url_get(url, :cachetype, 'tree')
+ $logger.debug "encoded_filename: type=#{type} url=#{url}"
+ rv = File.join(@cfgdir, dir, encoded_basename(url, type))
+ rd = File.dirname(rv)
+ $logger.debug "encoded_filename: rv0=#{rv}"
+ fm = optval_get(:global, :filename_size, 255)
+ rdok = !ensure_dir || @app.ensure_dir(rd, false)
+ if !rdok or rv.size > fm or File.directory?(rv)
+ # $logger.debug "Filename too long (:global=>:filename_size = #{fm}), try md5 encoded filename instead: #{url}"
+ $logger.info "Can't use filename, try 'md5' instead: #{url}"
+ rv = File.join(@cfgdir, dir, encoded_basename(url, :md5))
+ rd = File.dirname(rv)
+ end
+ @urlencmap[rv] = url
+ return rv
+ end
+
+
+ def encoded_basename(url, type='tree')
+ m = "encoded_basename_#{type}"
+ if respond_to?(m)
+ return send(m, url)
+ else
+ $logger.fatal "Unknown cache type: #{type}"
+ exit 5
+ end
+ end
+
+
+ def encoded_basename_tree(url)
+ ensure_filename(encode(url, '/'))
+ end
+
+
+ def encoded_basename_flat(url)
+ encode(url)
+ end
+
+
+ def encoded_basename_md5(url)
+ Digest::MD5.hexdigest(url)
+ end
+
+
+ def urlextname(url)
+ begin
+ return File.extname(URI.parse(url).path)
+ rescue Exception => e
+ end
+ end
+
+
+ # Guess path's dirname.
+ # foo/bar -> foo
+ # foo/bar.txt -> foo
+ # foo/bar/ -> foo/bar
+ def guess_dir(path)
+ path[-1..-1] == '/' ? path[0..-2] : File.dirname(path)
+ end
+
+
+ def save_dir(url, dir, title=nil)
+ case dir
+ when true
+ title = url_get(url, :title) || encode(title)
+ dir = File.join(@cfgdir, 'attachments', title)
+ when Proc
+ dir = dir.call(url)
+ end
+ @app.ensure_dir(dir) if dir
+ return dir
+ end
+
+
+ def clean_url(url)
+ url && url.strip
+ end
+
+
+ # Strip the url's last part (after #).
+ def canonic_url(url)
+ url.sub(/#.*$/, '')
+ end
+
+
+ def strip_tags_default
+ success, tags = opt_get(:strip_tags, :default)
+ tags.dup if success
+ end
+
+
+ def strip_tags(doc, args={})
+ tags = args[:tags] || strip_tags_default
+ case doc
+ when String
+ doc = Hpricot(doc)
+ end
+ tags.each do |tag|
+ doc.search(tag).remove
+ end
+ case args[:format]
+ when :hpricot
+ doc
+ else
+ doc.send("to_#{args[:format] || :html}")
+ end
+ end
+
+
+ # Check whether path is eligible on the basis of url or path0.
+ # This checks either for a :match option for url or the extensions
+ # of path0 and path.
+ def eligible_path?(url, path0, path)
+ rx = url_get(url, :match)
+ if rx
+ return path =~ rx
+ else
+ return File.extname(path0) == File.extname(path)
+ end
+ end
+
+
+ # Scan hpricot document for hrefs and push the onto @todo if not
+ # already included.
+ def push_hrefs(url, hpricot, &condition)
+ begin
+ $logger.debug "push_refs: #{url}"
+ return if robots?(hpricot, 'nofollow') or is_excluded?(url)
+ depth = url_get(url, :depth)
+ return if depth and depth <= 0
+ uri0 = URI.parse(url)
+ # pn0 = Pathname.new(guess_dir(File.expand_path(uri0.path)))
+ pn0 = Pathname.new(guess_dir(uri0.path))
+ (hpricot / 'a').each do |a|
+ next if a['rel'] == 'nofollow'
+ href = clean_url(a['href'])
+ next if href.nil? or href == url or is_excluded?(href)
+ uri = URI.parse(href)
+ pn = guess_dir(uri.path)
+ href = rewrite_href(href, url, uri0, pn0, true)
+ curl = canonic_url(href)
+ next if !href or href.nil? or @done.include?(curl) or @todo.include?(curl)
+ # pn = Pathname.new(guess_dir(File.expand_path(uri.path)))
+ uri = URI.parse(href)
+ pn = Pathname.new(guess_dir(uri.path))
+ next unless condition.call(uri0, pn0, uri, pn)
+ next unless robots_allowed?(curl, uri)
+ opts = @urls[url].dup
+ # opts[:title] = File.basename(curl)
+ opts[:title] = [opts[:title], File.basename(curl)].join(' - ')
+ opts[:depth] = depth - 1 if depth and depth >= 0
+ # opts[:sleep] = delay if delay
+ url_set(curl, opts)
+ to_do curl
+ end
+ rescue Exception => e
+ # $logger.error e #DBG#
+ $logger.error e.message
+ $logger.debug e.backtrace
+ end
+ end
+
+
+ # Rewrite urls in doc
+ # url:: String
+ # doc:: Hpricot document
+ def rewrite_urls(url, doc)
+ uri = URI.parse(url)
+ urd = guess_dir(uri.path)
+ (doc / 'a').each do |a|
+ href = clean_url(a['href'])
+ if is_excluded?(href)
+ comment_element(doc, a)
+ else
+ href = rewrite_href(href, url, uri, urd, true)
+ a['href'] = href if href
+ end
+ end
+ (doc / 'img').each do |a|
+ href = clean_url(a['src'])
+ if is_excluded?(href)
+ comment_element(doc, a)
+ else
+ href = rewrite_href(href, url, uri, urd, false)
+ a['src'] = href if href
+ end
+ end
+ doc
+ end
+