Skip to content
Browse files

Initial

  • Loading branch information...
0 parents commit 3e248db216d302a1b80c81744697afe7d21f5fbc Tom Link committed Jun 7, 2009
Showing with 3,899 additions and 0 deletions.
  1. +112 −0 History.txt
  2. +88 −0 Makefile
  3. +100 −0 Makefile.config
  4. +780 −0 README.txt
  5. +31 −0 Rakefile
  6. +43 −0 bin/websitary
  7. +30 −0 index.txt
  8. +555 −0 lib/websitary.rb
  9. +39 −0 lib/websitary/applog.rb
  10. +1,903 −0 lib/websitary/configuration.rb
  11. +58 −0 lib/websitary/filemtimes.rb
  12. +160 −0 lib/websitary/htmldiff.rb
112 History.txt
@@ -0,0 +1,112 @@
+= 0.6
+
+* RSS attachments: Source title is preferred to the channel's title.
+* body_html: If there is no body tag, use the document as is.
+* rss: also scan items without descriptions with :rss_find_enclosure
+
+= 0.5
+
+* mailto: and javascript: hrefs are now handled via the exclude option
+* rewrite absolute URLs sans host correctly
+* strip href and image src tags in order to prevent parser errors
+* some scaffolding for mechanize
+* global proxy option (currently only used for mechanize)
+* use -nolist for lynx
+* catch errors in Websitary::App#execute_downdiff
+* :rss_find_enclosure => LAMBDA: Extract the enclosure URL from the item
+ description
+* :rss_format_local_copy => STRING|BLOCK/2: Format the display of the
+ local copy.
+
+
+= 0.4
+
+* Sources may have a :timeout option.
+* exclude: Argument can be a string or a regexp.
+* htmldiff: :ignore option to exclude certain nodes from the diff.
+* Left-mouse clicks make items collapse/expand.
+* iconv: Support for converting encodings (require the per-url iconv
+ option to be set).
+* exclude mailto urls.
+
+
+= 0.3
+
+* Renamed the global option :downloadhtml to :download_html.
+* The downloader for robots and rss enclosures should now be properly
+ configurable via the global options :download_robots and
+ :download_rss_enclosure (default: :openuri).
+* Respect rel="nofollow" on hyperreferences.
+* :wdays, :mdays didn't work.
+* --exclude command line options, exclude configuration command
+* Check for robots.txt-compliance after testing if the URL is
+ appropriate.
+* htmldiff.rb can now also highlight differences � la websec's webdiff.
+* configuration.rb: Ignore pubDate and certain other non-essential fields (tags
+ etc.) when constructing rss item IDs.
+
+
+= 0.2.1
+
+* Use URI.merge for constructing robots.txt uri.
+* Fixed minor show-stopper.
+
+
+= 0.2.0
+
+* Renamed the project from websitiary to websitary (without the
+ additional "i")
+* The default output filename is now constructed on basis of the profile
+ names joined with a comma.
+* Apply rewrite-rules to URLs in text output.
+* Set user-agent (:body_html)
+* Exit with 1 if differences were found
+* Command line options have slightly changed: -e now is the short form
+ for --execute
+* Commands that can be triggered by the -e command-line switch: downdiff
+ (default), configuration (list currently configured urls), latest
+ (show the current version of all urls), review (show the latest
+ report)
+* Protect against filenames being too long (max size can be configured
+ via: <tt>option :global, :filename_size => N</tt>)
+* Try to migrate local copies from the older flat to the new
+ hierarchical cache layout
+* Disabled -E/--edit, --review command-line options (use -e instead)
+* Try to maintain file atime/mtime when copying/moving files
+* FIX: Problem with loading robots.txt
+* Respect meta tag: robots="nofollow" (noindex is only checked in
+ conjunction with :download => :website*)
+* quicklist profile: register urls via the -eadd command-line switch;
+ see "Usage" for an example
+* Temporaly save diffs, so that we can reuse them when websitary should
+ exit ungracefully.
+* Renamed :inner_html to :body_html
+* New shortcuts: :ftp, :ftp_recursive, :img, :rss, :opml (rudementary)
+* New experimental commands: aggregate, show ... can be used to
+ periodically check for changes (e.g. of rss feeds) but to review these
+ changes only once in a while
+* Experimental --timer command-line option to re-run websitary every X
+ seconds.
+* The :rss differ has an option :rss_enclosure (true or directory name)
+ that will be used for automatically saving new enclosures (e.g. mp3
+ files in podcasts); in theory, one should thus be able to use
+ websitary as pod catcher etc.
+* Cache mtimes in order to reduce disk access.
+* Special profile "__END__": The section in the script file after the
+ __END__ line. This seems useful in some situations when employing a
+ single script.
+* Don't follow javascript links.
+* New date constraint for sources:
+ :daily => true ... Once a day
+ :days_of_month => BEGIN..END ... download URL only once per month
+ within this range of days.
+ :days_of_week => BEGIN..END ... download URL only once per week
+ within this range of days.
+ :months => N (calculated on basis of the calendar month, not the
+ number of days)
+
+
+== 0.1.0 / 2007-07-16
+
+* Initial release
+
88 Makefile
@@ -0,0 +1,88 @@
+include Makefile.config
+
+all: dbk html pdf tex text man
+
+dvi: ${BASE}.dvi
+dbk: ${BASE}.dbk
+html: ${BASE}.html
+pdf:
+ make DFLAGS="${DFLAGS} --pdf" "${BASE}.pdf"
+php: ${BASE}.php
+tex: ${BASE}.tex
+text: ${BASE}.text
+man: ${BASE}.1
+
+pdfclean: pdf cleantex
+dviclean: dvi cleantex
+
+makefile:
+ ${DEPLATE} -m makefile ${DFLAGS} ${BASE}.txt ${OTHER}
+
+website:
+ make prepare_website
+ ${DEPLATE} ${DFLAGS} ${WEBSITE_DFLAGS} ${FILE} ${OTHER}
+ echo ${WEBSITE_DIR}/${BASE}.html > .last_output
+
+%.html: %.txt
+ make prepare_html
+ ${DEPLATE} ${DFLAGS} ${HTML_DFLAGS} $< ${OTHER}
+ echo ${HTML_DIR}/$@ > .last_output
+
+%.text: %.txt
+ make prepare_text
+ ${DEPLATE} ${DFLAGS} ${TEXT_DFLAGS} $< ${OTHER}
+ echo ${TEXT_DIR}/$@ > .last_output
+
+%.php: %.txt
+ make prepare_php
+ ${DEPLATE} ${DFLAGS} ${PHP_DFLAGS} $< ${OTHER}
+ echo ${PHP_DIR}/$@ > .last_output
+
+%.dbk: %.txt
+ make prepare_dbk
+ ${DEPLATE} ${DFLAGS} ${DBK_DFLAGS} $< ${OTHER}
+ echo ${DBK_DIR}/$@ > .last_output
+
+%.tex: %.txt
+ make prepare_tex
+ ${DEPLATE} ${DFLAGS} ${TEX_DFLAGS} $< ${OTHER}
+ echo ${TEX_DIR}/$@ > .last_output
+
+%.ref: %.txt
+ make prepare_ref
+ ${DEPLATE} ${DFLAGS} ${REF_DFLAGS} -o $@ $< ${OTHER}
+ echo ${REF_DIR}/$@ > .last_output
+
+%.dvi: %.tex
+ make prepare_dvi
+ cd ${TEX_DIR}; \
+ latex ${LATEX_FLAGS} $<; \
+ bibtex ${BIBTEX_FLAGS} $*; \
+ latex ${LATEX_FLAGS} $<; \
+ latex ${LATEX_FLAGS} $<;
+ echo ${TEX_DIR}/$@ > .last_output
+
+%.pdf: %.tex
+ make prepare_pdf
+ cd ${TEX_DIR}; \
+ pdflatex ${PDFLATEX_FLAGS} $<; \
+ bibtex ${BIBTEX_FLAGS} $*; \
+ pdflatex ${PDFLATEX_FLAGS} $<; \
+ pdflatex ${PDFLATEX_FLAGS} $<
+ echo ${TEX_DIR}/$@ > .last_output
+
+%.1: %.ref
+ cd ${REF_DIR}; \
+ xmlto man $<
+ echo ${REF_DIR}/$@ > .last_output
+
+view: show
+show:
+ cygstart `cat .last_output`
+
+cleantex:
+ cd ${TEX_DIR}; \
+ rm -f *.toc *.aux *.log *.cp *.fn *.tp *.vr *.pg *.ky \
+ *.blg *.bbl *.out *.lot *.ind *.4tc *.4ct \
+ *.ilg *.idx *.idv *.lg *.xref || echo Nothing to be done!
+
100 Makefile.config
@@ -0,0 +1,100 @@
+FILE=index.txt
+BASE=$(basename ${FILE})
+OTHER=
+
+DEPLATE=deplate
+SCP=scp
+# SCP=pscp
+
+ # --css deplate \
+ # -t html-tabbar-top.html
+DFLAGS=-m code-gvim -D noSwallow=1
+
+HTML_DIR=html
+HTML_PLUS=-m html-obfuscate-email -m html-deplate-button \
+ --css deplate \
+ -t html-tabbar-right.html -m navbar-png
+HTML_DFLAGS=${HTML_PLUS} -d ${HTML_DIR} -f html
+
+WEBSITE_DIR=website
+WEBSITE_DFLAGS=${HTML_PLUS} -d ${WEBSITE_DIR} -f htmlsite
+
+PHP_DIR=php
+PHP_DFLAGS=${HTML_PLUS} -d ${PHP_DIR} -f phpsite -m html-obfuscate-email
+
+TEX_DIR=tex
+TEX_DFLAGS=-d ${TEX_DIR} -f latex
+
+TEXT_DIR=plain
+TEXT_DFLAGS=-d ${TEXT_DIR} -f plain
+
+DBK_DIR=docbook
+DBK_DFLAGS=-d ${DBK_DIR} -f dbk-article
+
+REF_DIR=${DBK_DIR}
+REF_DFLAGS=-d ${REF_DIR} -f dbk-ref
+
+LATEX_FLAGS=-interaction=nonstopmode
+PDFLATEX_FLAGS=${LATEX_FLAGS}
+BIBTEX_FLAGS=
+
+copy_images=if ls *.{jpg,jpeg,png,gif} 2> /dev/null; then cp -uv *.{jpg,jpeg,png,gif} $(1); fi
+copy_css=if ls *.css 2> /dev/null; then cp -vu *.css $(1); fi
+
+.PHONY: view show cleantex website makefile pdfclean dviclean dbk html pdf tex text man prepare_website prepare_html prepare_text prepare_php prepare_dbk prepare_tex prepare_ref prepare_dvi prepare_pdf
+
+default: website
+
+manual.pdf:
+ make FILE=websitary.txt pdf
+
+manual: website manual.pdf
+
+upload:
+ ${SCP} website/* tex/websitary.pdf tlink@rubyforge.org:/var/www/gforge-projects/websitiary/
+
+docs:
+ rake docs
+ ${SCP} docs/* tlink@rubyforge.org:/var/www/gforge-projects/websitiary/websitary/
+
+prepare_website:
+ mkdir -p ${WEBSITE_DIR}
+ $(call copy_images,"${WEBSITE_DIR}")
+ $(call copy_css,"${WEBSITE_DIR}")
+
+prepare_html:
+ mkdir -p ${HTML_DIR}
+ $(call copy_images,"${HTML_DIR}")
+ $(call copy_css,"${HTML_DIR}")
+
+prepare_text:
+ mkdir -p ${TEXT_DIR}
+
+prepare_php:
+ mkdir -p ${PHP_DIR}
+ $(call copy_images,"${PHP_DIR}")
+ $(call copy_css,"${PHP_DIR}")
+
+prepare_dbk:
+ mkdir -p ${DBK_DIR}cp -uv *.{jpg,jpeg,png,gif} $(1) || echo ... ignore errorcp -uv *.{jpg,jpeg,png,gif} $(1) || echo ... ignore error
+ $(call copy_images,"${DBG_DIR}")
+
+prepare_tex:
+ mkdir -p ${TEX_DIR}
+ $(call copy_images,"${TEX_DIR}")
+
+prepare_ref:
+ mkdir -p ${REF_DIR}
+
+prepare_dvi:
+
+prepare_pdf:
+
+ctags:
+ rm tags
+ ctags -R bin lib
+
+files:
+ find bin lib -name "*.rb" > files.lst
+
+# vi: ft=make:tw=72:ts=4
780 README.txt
@@ -0,0 +1,780 @@
+websitary by Thomas Link
+http://rubyforge.org/projects/websitiary/
+
+This script monitors webpages, rss feeds, podcasts etc. and reports
+what's new. For many tasks, it reuses other programs to do the actual
+work. By default, it works on an ASCII basis, i.e. with the output of
+text-based webbrowsers. With the help of some friends, it works also
+with HTML.
+
+
+== DESCRIPTION:
+websitary (formerly known as websitiary with an extra "i") monitors
+webpages, rss feeds, podcasts etc. It reuses other programs (w3m, diff
+etc.) to do most of the actual work. By default, it works on an ASCII
+basis, i.e. with the output of text-based webbrowsers like w3m (or lynx,
+links etc.) as the output can easily be post-processed. It can also work
+with HTML and highlight new items. This script was originally planned as
+a ruby-based websec replacement.
+
+By default, this script will use w3m to dump HTML pages and then run
+diff over the current page and the previous backup. Some pages are
+better viewed with lynx or links. Downloaded documents (HTML or ASCII)
+can be post-processed (e.g., filtered through some ruby block that
+extracts elements via hpricot and the like). Please see the
+configuration options below to find out how to change this globally or
+for a single source.
+
+This user manual is also available as
+PDF[http://websitiary.rubyforge.org/websitary.pdf].
+
+
+== FEATURES/PROBLEMS:
+* Handle webpages, rss feeds (optionally save attachments in podcasts
+ etc.)
+* Compare webpages with previous backups
+* Display differences between the current version and the backup
+* Provide hooks to post-process the downloaded documents and the diff
+* Display a one-page report summarizing all news
+* Automatically open the report in your favourite web-browser
+* Experimental: Download webpages on defined intervalls and generate
+ incremental diffs.
+
+ISSUES, TODO:
+* With HTML output, changes are presented on one single page, which
+ means that pages with different encodings cause problems.
+* Improved support for robots.txt (test it)
+* The use of :website_below and :website is hardly tested (please
+ report errors).
+* download => :body_html tries to rewrite references (a, img) which may
+ fail on certain kind of urls (please report errors).
+* When using :body_html for download, it may happen that some
+ JavaScript code is stripped, which breaks some JavaScript-generated
+ links.
+* The --log command line will create a new instance of the logger and
+ thus reset any previous options related to the logging level.
+
+NOTE: The script was previously called websitiary but was renamed (from
+0.2 on) to websitary (without the superfluous i).
+
+
+=== Caveat
+The script also includes experimental support for monitoring whole
+websites. Basically, this script supports robots.txt directives (see
+requirements) but this is hardly tested and may not work in some cases.
+
+While it is okay for your own websites to ignore robots.txt, it is not
+for others. Please make sure that the webpages you run this program on
+allow such a use. Some webpages disallow the use of any automatic
+downloader or offline reader in their user agreements.
+
+
+== SYNOPSIS:
+
+=== Usage
+Example:
+ # Run "profile"
+ websitary profile
+
+ # Edit "~/.websitary/profile.rb"
+ websitary --edit=profile
+
+ # View the latest report
+ websitary -ereview
+
+ # Refetch all sources regardless of :days and :hours restrictions
+ websitary -signore_age=true
+
+ # Create html and rss reports for my websites
+ websitary -fhtml,rss mysites
+
+ # Add an url to the quicklist profile
+ websitary -eadd http://www.example.com
+
+For example output see:
+* html[http://deplate.sourceforge.net/websitary.html]
+* rss[http://deplate.sourceforge.net/websitary.rss]
+* text[http://deplate.sourceforge.net/websitary.txt]
+
+
+=== Configuration
+Profiles are plain ruby files (with the '.rb' suffix) stored in
+~/.websitary/.
+
+The profile "config" (~/.websitary/config.rb) is always loaded if
+available.
+
+There are two special profile names:
+
+-::
+ Read URLs from STDIN.
+<tt>__END__</tt>::
+ Read the profile contained in the script source after the __END__
+ line.
+
+
+==== default 'PROFILE1', 'PROFILE2' ...
+Set the default profile(s). The default is: quicklist
+
+Example:
+ default 'my_profile'
+
+
+==== diff 'CMD "%s" "%s"'
+Use this shell command to make the diff.
+%s %s will be replaced with the old and new filename.
+
+diff is used by default.
+
+
+==== diffprocess lambda {|text| ...}
+Use this ruby snippet to post-process the diff.
+
+
+==== download 'CMD "%s"'
+Use this shell command to download a page.
+%s will be replaced with the url.
+
+w3m is used by default.
+
+Example:
+ download 'lynx -dump "%s"'
+
+
+==== downloadprocess lambda {|text| ...}
+Use this ruby snippet to post-process what was downloaded. Return the
+new text.
+
+
+==== edit 'CMD "%s"'
+Use this shell command to edit a profile. %s will be replaced with the filename.
+
+vi is used by default.
+
+Example:
+ edit 'gvim "%s"&'
+
+
+==== option TYPE, OPTION => VALUE
+Set a global option.
+
+TYPE can be one of:
+<tt>:diff</tt>::
+ Generate a diff
+<tt>:diffprocess</tt>::
+ Post-process a diff (if necessary)
+<tt>:format</tt>::
+ Format the diff for output
+<tt>:download</tt>::
+ Download webpages
+<tt>:downloadprocess</tt>::
+ Post-process downloaded webpages
+<tt>:page</tt>::
+ The :format field defines the format of the final report. Here VALUE
+ is a format string that takes 3 variables as arguments: report title,
+ toc, contents.
+<tt>:global</tt>::
+ Set a "global" option.
+
+DOWNLOAD is a symbol
+
+VALUE is either a format string or a block of code (of class Proc).
+
+Example:
+ set :download, :foo => lambda {|url| get_url(url)}
+
+
+==== global OPTION => VALUE
+This is the same a <tt>option :global, OPTION => VALUE</tt>.
+
+Known global options:
+
+<tt>:canonic_filename => BLOCK(FILENAME)</tt>::
+ Rewrite filenames as they are stored in the mtimes register. This may
+ useful if you want to use the same repository on several computers
+ with in different locations etc.
+
+<tt>:encoding => OUTPUT_DOCUMENT_ENCODING</tt>::
+ The default is 'ISO-8859-1'.
+
+<tt>:downloadhtml => SHORTCUT</tt>::
+ The default shortcut for downloading plain HTML.
+
+<tt>:file_url => BLOCK(FILENAME)</tt>::
+ Rewrite a filename as it is used for creating file urls to local
+ copies in the output. This may useful if you want to use the same
+ repository on several computers with in different locations etc.
+
+<tt>:filename_size => N</tt>::
+ The max filename size. If a filename becomes longer, md5 encoding will
+ be used for local copies in the cache.
+
+<tt>:toggle_body => BOOLEAN</tt>::
+ If true, make a news body collabsable on mouse-clicks (sort of).
+
+<tt>:proxy => STRING</tt>, <tt>:proxy => ARRAY</tt>::
+ The proxy. (currently only supported by mechanize)
+
+<tt>:user_agent => STRING</tt>::
+ Set the user agent (only for certain queries).
+
+
+==== output_format FORMAT, output_format [FORMAT1, FORMAT2, ...]
+Set the output format.
+Format can be one of:
+
+* html
+* text, txt (this only works with text based downloaders)
+* rss (prove of concept only;
+ it requires :rss[:url] to be set to the url, where the rss feed will
+ be published, using the <tt>option :rss, :url => URL</tt>
+ configuration command; you either have to use a text-based downloader
+ or include <tt>:rss_format => 'html'</tt> to the url options)
+
+
+==== set OPTION => VALUE; set TYPE, OPTION => VALUE; unset OPTIONS
+(Un)Set an option for the following source commands.
+
+Example:
+ set :download, :foo => lambda {|url| get_url(url)}
+ set :days => 7, sort => true
+ unset :days, :sort
+
+
+==== source URL(S), [OPTIONS]
+Options
+
+<tt>:cols => FROM..TO</tt>::
+ Use only these colums from the output (used after applying the :lines
+ option)
+
+<tt>:depth => INTEGER</tt>::
+ In conjunction with a :website type of :download option, fetch url up
+ to this depth.
+
+<tt>:diff => "CMD", :diff => SHORTCUT</tt>::
+ Use this command to make the diff for this page. Possible values for
+ SHORTCUT are: :webdiff (useful in conjunction with :download => :curl,
+ :wget, or :body_html), :websec_webdiff (use websec's webdiff tool),
+ :body_html, :website_below, :website and :openuri are synonyms for
+ :webdiff.
+ NOTE: Since version 0.3, :webdiff is mapped to websitary's own
+ htmldiff class (which can also be used as stand-alone script). Before
+ 0.3, websitary used websec's webdiff script, which is now mapped to
+ :websec_webdiff.
+
+<tt>:diffprocess => lambda {|text| ...}</tt>::
+ Use this ruby snippet to post-process this diff
+
+<tt>:download => "CMD", :download => SHORTCUT</tt>::
+ Use this command to download this page. For possible values for
+ SHORTCUT see the section on shortcuts below.
+
+<tt>:downloadprocess => lambda {|text| ...}</tt>::
+ Use this ruby snippet to post-process what was downloaded. This is the
+ place where, e.g., hpricot can be used to extract certain elements
+ from the HTML code.
+ Example:
+ lambda {|text| Hpricot(text).at('div#content').inner_html}
+
+<tt>:format => "FORMAT %s STRING", :format => SHORTCUT</tt>::
+ The format string for the diff text. The default (the :diff shortcut)
+ wraps the output in +pre+ tags. :webdiff, :body_html, :website_below,
+ :website, and :openuri will simply add a newline character.
+
+<tt>:iconv => ENCODING</tt>::
+ If set, use iconv to convert the page body into the summary's document
+ encoding (see the 'global' section). Websitary currently isn't able to
+ automatically determine and convert encodings.
+
+<tt>:timeout => SECONDS</tt>::
+ When using openuri, download the page with a timeout.
+
+<tt>:hours => HOURS, :days => DAYS</tt>::
+ Don't download the file unless it's older than that
+
+<tt>:days_of_month => DAY..DAY, :wdays => DAY..DAY</tt>::
+ Download only once per month within a certain range of days (e.g.,
+ 15..31 ... Check once after the 15th). The argument can also be an
+ array (e.g, [1, 15]) or an integer.
+
+<tt>:days_of_week => DAY..DAY, :mdays => DAY..DAY</tt>::
+ Download only once per week within a certain range of days (e.g., 1..2
+ ... Check once on monday or tuesday; sunday = 0). The argument can
+ also be an array (e.g, [1, 15]) or an integer.
+
+<tt>:daily => true</tt>::
+ Download only once a day.
+
+<tt>:ignore_age => true</tt>::
+ Ignore any :days and :hours settings. This is useful in some cases
+ when set on the command line.
+
+<tt>:lines => FROM..TO</tt>::
+ Use only these lines from the output
+
+<tt>:match => REGEXP</tt>::
+ When recursively walking a website, follow only links that match this
+ regexp.
+
+<tt>:rss_rewrite_enclosed_urls => true</tt>::
+ If true, replace urls in the rss feed item description pointing to the
+ enclosure with a file url pointing to the local copy
+
+<tt>:rss_enclosure => true|"DIRECTORY"</tt>::
+ If true, save rss feed enclosures in
+ "~/.websitary/attachments/RSS_FEED_NAME/". If a string, use this as
+ destination directory. Only enclosures of new items will be saved --
+ i.e. when downloading a feed for the first time, no enclosures will be
+ saved.
+
+<tt>:rss_find_enclosure => BLOCK</tt>::
+ Certain RSS-feeds embed enclosures in the description. Use this option
+ to scan the description (a Hpricot document) for an URL that is then saved
+ as enclosure if the :rss_enclosure option is set.
+ Example:
+ source 'http://www.example.com/rss',
+ :title => 'Example',
+ :use => :rss, :rss_enclosure => true,
+ :rss_find_enclosure => lambda {|item, doc| (doc / 'img').map {|e| e['src']}[0]}
+
+<tt>:rss_format (default: "plain_text")</tt>::
+ When output format is :rss, create rss item descriptios as plain text.
+
+<tt>:rss_format_local_copy => FORMAT_STRING | BLOCK</tt>::
+ By default a hypertext reference to the local copy of an RSS
+ enclosure is added to entry. Sometimes you may want to display
+ something inline (e.g. an image). You can then use this option to
+ define a format string (one field = the local copy's file url).
+
+<tt>:show_initial => true</tt>::
+ Include initial copies in the report (may not always work properly).
+ This can also be set as a global option.
+
+<tt>:sleep => SECS</tt>::
+ Wait SECS seconds (float or integer) before downloading the page.
+
+<tt>:sort => true, :sort => lambda {|a,b| ...}</tt>::
+ Sort lines in output
+
+<tt>:strip => true</tt>::
+ Strip empty lines
+
+<tt>:title => "TEXT"</tt>::
+ Display TEXT instead of URL
+
+<tt>:use => SYMBOL</tt>::
+ Use SYMBOL for any other option. I.e. <tt>:download => :body_html
+ :diff => :webdiff</tt> can be abbreviated as <tt>:use =>
+ :body_html</tt> (because for :diff :body_html is a synonym for
+ :webdiff).
+
+The order of age constraints is:
+:hours > :daily > :wdays > :mdays > :days > :months.
+I.e. if :wdays is set, :mdays, :days, or :months are ignored.
+
+
+==== view 'CMD "%s"'
+Use this shell command to view the output (usually a HTML file).
+%s will be replaced with the filename.
+
+w3m is used by default.
+
+Example:
+ view 'gnome-open "%s"' # Gnome Desktop
+ view 'kfmclient "%s"' # KDE
+ view 'cygstart "%s"' # Cygwin
+ view 'start "%s"' # Windows
+ view 'firefox "%s"'
+
+
+=== Shortcuts for use with :use, :download and other options
+<tt>:w3m</tt>::
+ Use w3m for downloading the source. Use diff for generating diffs.
+
+<tt>:lynx</tt>::
+ Use lynx for downloading the source. Use diff for generating diffs.
+ Lynx doesn't try to recreate the layout of a page like w3m or links
+ do. As a result the output IMHO sometimes deviates from the original
+ design but is better suited for being post-processed in some
+ situation.
+
+<tt>:links</tt>::
+ Use links for downloading the source. Use diff for generating diffs.
+
+<tt>:curl</tt>::
+ Use curl for downloading the source. Use webdiff for generating diffs.
+
+<tt>:wget</tt>::
+ Use wget for downloading the source. Use webdiff for generating diffs.
+
+<tt>:openuri</tt>::
+ Use open-uri for downloading the source. Use webdiff for generating
+ diffs. This doesn't handle cookies and the like.
+
+<tt>:mechanize</tt>::
+ Use mechanize (must be installed) for downloading the source. Use
+ webdiff for generating diffs. This calls the URL's :mechanize property
+ (a lambda that takes 3 arguments: URL, agent, page => HTML as string)
+ to post-process the page (or if not available, use the page body's
+ HTML).
+
+<tt>:text</tt>::
+ This requires hpricot to be installed. Use open-uri for downloading
+ and hpricot for converting HTML to plain text. This still requires
+ diff as external helper.
+
+<tt>:body_html</tt>::
+ This requires hpricot to be installed. Use open-uri for downloading
+ the source, use only the body. Use webdiff for generating diffs. Try
+ to rewrite references (a, img) so that the point to the webpage. By
+ default, this will also strip tags like script, form, object ...
+
+<tt>:website</tt>::
+ Use :body_html to download the source. Follow all links referring to
+ the same host with the same file suffix. Use webdiff for generating
+ diff.
+
+<tt>:website_below</tt>::
+ Use :body_html to download the source. Follow all links referring to
+ the same host and a file below the top directory with the same file
+ suffix. Use webdiff for generating diff.
+
+<tt>:website_txt</tt>::
+ Use :website to download the source but convert the output to plain
+ text.
+
+<tt>:website_txt_below</tt>::
+ Use :website_below to download the source but convert the output to
+ plain text.
+
+<tt>:rss</tt>::
+ Download an rss feed, show changed items.
+
+<tt>:opml</tt>::
+ Experimental. Download the rss feeds registered in opml. No support
+ for atom yet.
+
+<tt>:img</tt>::
+ Download an image and display it in the output if it has changed
+ (according to diff). You can use hpricot to extract an image from a
+ HTML source. Example:
+
+Any shortcuts relying on :body_html will also try to rewrite any
+references so that the links point to the webpage.
+
+
+
+=== Example configuration file for demonstration purposes
+
+ # Daily
+ set :days => 1
+
+ # Use lynx instead of the default downloader (w3m).
+ source 'http://www.example.com', :days => 7, :download => :lynx
+
+ # Use the HTML body and process via webdiff.
+ source 'http://www.example.com', :use => :body_html,
+ :downloadprocess => lambda {|text| Hpricot(text).at('div#content').inner_html}
+
+ # Download a podcast
+ source 'http://www.example.com/podcast.xml', :title => 'Podcast',
+ :use => :rss,
+ :rss_enclosure => '/home/me/podcasts/example'
+
+ # Check a rss feed.
+ source 'http://www.example.com/news.xml', :title => 'News', :use => :rss
+
+ # Get rss feed info from an opml file (EXPERIMENTAL).
+ # @cfgdir is most likely '~/.websitary'.
+ source File.join(@cfgdir, 'news.opml'), :use => :opml
+
+
+ # Weekly
+ set :days => 7
+
+ # Consider the page body only from the 10th line downwards.
+ source 'http://www.example.com', :lines => 10..-1, :title => 'My Page'
+
+
+ # Bi-weekly
+ set :days => 14
+
+ # Use these urls with the default options.
+ source <<URLS
+ http://www.example.com
+ http://www.example.com/page.html
+ URLS
+
+ # Make HTML diffs and highlight occurences of a word
+ source 'http://www.example.com',
+ :title => 'Example',
+ :use => :body_html,
+ :diffprocess => highlighter(/word/i)
+
+ # Download the whole website below this path (only pages with
+ # html-suffix), wait 30 secs between downloads.
+ # Download only php and html pages
+ # Follow links 2 levels deep
+ source 'http://www.example.com/foo/bar.html',
+ :title => 'Example -- Bar',
+ :use => :website_below, :sleep => 30,
+ :match => /\.(php|html)\b/, :depth => 2
+
+ # Download images from some kind of daily-image site (check the user
+ # agreement first, if this is allowed). This may require some ruby
+ # hacking in order to extract the right url.
+ source 'http://www.example.com/daily_image/', :title => 'Daily Image',
+ :use => :img,
+ :download => lambda {|url|
+ rv = nil
+ # Read the HTML.
+ html = open(url) {|io| io.read}
+ # This check is probably unnecessary as the failure to read
+ # the HTML document would most likely result in an
+ # exception.
+ if html
+ # Parse the HTML document.
+ doc = Hpricot(html)
+ # The following could actually be simplified using xpath
+ # or css search expressions. This isn't the most elegant
+ # solution but it works with any value of ALT.
+ # This downloads the image <img src="..." alt="Current Image">
+ # Check all img tags in the HTML document.
+ for e in doc.search(%{//img})
+ # Is this the image we're looking for?
+ if e['alt'] == "Current Image"
+ # Make relative urls absolute
+ img = rewrite_href(e['src'], url)
+ # Get the actual image data
+ rv = open(img, 'rb') {|io| io.read}
+ # Exit the for loop
+ break
+ end
+ end
+ rv
+ end
+ }
+
+
+ unset :days
+
+
+
+=== Commands for use with the -e command-line option
+Most of these commands require you to name a profile on the command
+line. You can define default profiles with the "default" configuration
+command.
+
+If no command is given, "downdiff" is executed.
+
+add::
+ Add the URLs given on the command line to the quicklist profile.
+ ATTENTION: The following arguments on the command line are URLs, not
+ profile names.
+
+aggregate::
+ Retrieve information and save changes for later review.
+
+configuration::
+ Show the fully qualified configuration of each source.
+
+downdiff::
+ Download and show differences (DEFAULT)
+
+edit::
+ Edit the profile given on the command line (use vi by default)
+
+latest::
+ Show the latest copies of the sources from the profiles given
+ on the command line.
+
+ls::
+ List number of aggregated diffs.
+
+rebuild::
+ Rebuild the latest report.
+
+review::
+ Review the latest report (just show it with the browser)
+
+show::
+ Show previously aggregated items. A typical use would be to
+ periodically run in the background a command like
+ websitary -eaggregate newsfeeds
+ and then
+ websitary -eshow newsfeeds
+ to review the changes.
+
+unroll::
+ Undo the latest fetch.
+
+
+
+== TIPS:
+=== Ruby
+The profiles are regular ruby sources that are evaluated in the context
+of the configuration object (Websitary::Configuration). Find out more
+about ruby at:
+* http://www.ruby-lang.org/en/documentation/
+* http://www.ruby-doc.org/docs/ProgrammingRuby/ (especially
+ the
+ language[http://www.ruby-doc.org/docs/ProgrammingRuby/html/language.html]
+ chapter)
+
+
+=== Cygwin
+Mixing native Windows apps and cygwin apps can cause problems. The
+following settings (e.g. in ~/.websitary/config.rb) can be used to use
+a native Windows editor and browser:
+
+ # Use the default Windows programs (as if double-clicked)
+ view '/usr/bin/cygstart "%s"'
+
+ # Translate the profile filename and edit it with a native Windows editor
+ edit 'notepad.exe $(cygpath -w -- "%s")'
+
+ # Rewrite cygwin filenames for use with a native Windows browser
+ option :global, :file_url => lambda {|f| f.sub(/\/cygdrive\/.+?\/.websitary\//, '')}
+
+
+=== Windows
+Backslashes usually have to be escaped by backslashes -- or use slashes.
+I.e. instead of 'c:\foo\bar' write either 'c:\\foo\\bar' or
+'c:/foo/bar'.
+
+
+== REQUIREMENTS:
+websitary is a ruby-based application. You thus need a ruby
+interpreter.
+
+It depends on how you use websitary whether you actually need the
+following libraries, applications.
+
+By default this script expects the following applications to be
+present:
+
+* diff
+* vi (or some other editor)
+
+and one of:
+
+* w3m[http://w3m.sourceforge.net/] (default)
+* lynx[http://lynx.isc.org/]
+* links[http://links.twibright.com/]
+
+The use of :websec_webdiff as :diff application requires
+websec[http://baruch.ev-en.org/proj/websec/] (or at
+Savannah[http://savannah.nongnu.org/projects/websec/]) to be installed.
+By default, websitary uses it's own htmldiff class/script, which is less
+well tested and may return inferior results in comparison with websec's
+webdiff. In conjunction with :body_html, :openuri, or :curl, this will
+give you colored HTML diffs.
+
+For downloading HTML, you need one of these:
+
+* open-uri (should be part of ruby)
+* hpricot[http://code.whytheluckystiff.net/hpricot] (used e.g. by
+ :body_html, :website, and :website_below)
+* curl[http://curl.haxx.se/]
+* wget[http://www.gnu.org/software/wget/]
+
+The following ruby libraries are needed in conjunction with :body_html
+and :website related shortcuts:
+
+* hpricot[http://code.whytheluckystiff.net/hpricot] (parse HTML, use
+ only the body etc.)
+* robot_rules.rb[http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/177589]
+ for parsing robots.txt
+
+I personally would suggest to choose the following setup:
+
+* w3m[http://w3m.sourceforge.net/]
+* hpricot[http://code.whytheluckystiff.net/hpricot]
+* robot_rules.rb[http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/177589]
+
+
+== INSTALL:
+=== Use rubygems
+Run
+
+ gem install websitary
+
+This will download the package and install it.
+
+
+=== Use the zip
+The zip[http://rubyforge.org/frs/?group_id=4030] contains a file
+setup.rb that does the work. Run
+
+ ruby setup.rb
+
+
+=== Initial Configuration
+Please check the requirements section above and get the extra libraries
+needed:
+* hpricot
+* robot_rules.rb
+
+These could be installed by:
+
+ # Install hpricot
+ gem install hpricot
+
+ # Install robot_rules.rb
+ wget http://www.rubyquiz.com/quiz64_sols.zip
+ # Check the correct path to site_ruby first!
+ unzip -p quiz64_sols.zip "solutions/James Edward Gray II/robot_rules.rb" > /lib/ruby/site_ruby/1.8/robot_rules.rb
+ rm quiz64_sols.zip
+
+You might then want to create a profile ~/.websitary/config.rb that is
+loaded on every run. In this profile you could set the default output
+viewer and profile editor, as well as a default profile.
+
+Example:
+
+ # Load standard.rb if no profile is given on the command line.
+ default 'standard'
+
+ # Use cygwin's cygstart to view the output with the default HTML
+ # viewer
+ view '/usr/bin/cygstart "%s"'
+
+ # Use Windows gvim from cygwin ruby which is why we convert the path
+ # first
+ edit 'gvim $(cygpath -w -- "%s")'
+
+Where these configuration files reside, may differ. If the environment
+variable $HOME is defined, the default is $HOME/.websitary/ unless one
+of the following directories exist, which will then be used instead:
+
+* $USERPROFILE/websitary (on Windows)
+* SYSCONFDIR/websitary (where SYSCONFDIR usually is /etc but you can
+ run ruby to find out more:
+ <tt>ruby -e "p Config::CONFIG['sysconfdir']"</tt>)
+
+If neither directory exists and no $HOME variable is defined, the
+current directory will be used.
+
+Now check out the configuration commands in the Synopsis section.
+
+
+== LICENSE:
+websitary Webpage Monitor
+Copyright (C) 2007-2008 Thomas Link
+
+This program is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; either version 2 of the License, or
+(at your option) any later version.
+
+This program is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with this program; if not, write to the Free Software
+Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307
+USA
+
31 Rakefile
@@ -0,0 +1,31 @@
+# -*- ruby -*-
+
+require 'rubygems'
+require 'hoe'
+load './lib/websitary.rb'
+
+Hoe.new('websitary', Websitary::VERSION) do |p|
+ p.rubyforge_name = 'websitiary'
+ p.author = 'Tom Link'
+ p.email = 'micathom at gmail com'
+ p.summary = 'A unified website news, rss feed, podcast monitor'
+ p.description = p.paragraphs_of('README.txt', 2..5).join("\n\n")
+ p.url = p.paragraphs_of('README.txt', 0).first.split(/\n/)[1..-1]
+ p.changes = p.paragraphs_of('History.txt', 0..1).join("\n\n")
+ p.extra_deps << 'hpricot'
+ # p.need_tgz = false
+ p.need_zip = true
+end
+
+require 'rtagstask'
+RTagsTask.new
+
+task :ctags do
+ puts `ctags --extra=+q --fields=+i+S -R bin lib`
+end
+
+task :files do
+ puts `find bin lib -name "*.rb" > files.lst`
+end
+
+# vim: syntax=Ruby
43 bin/websitary
@@ -0,0 +1,43 @@
+#! /usr/bin/env ruby
+# websitary.rb -- The website news, rss feed, podcast catching monitor
+# @Last Change: 2008-02-12.
+# Author:: Thomas Link (micathom at gmail com)
+# License:: GPL (see http://www.gnu.org/licenses/gpl.txt)
+# Created:: 2007-06-09.
+
+
+require 'websitary'
+
+
+if __FILE__ == $0
+ w = Websitary::App.new(ARGV)
+ t = w.configuration.optval_get(:global, :timer)
+ if t
+ exit_code = 0
+ while exit_code <= 1
+ exit_code = Websitary::App.new(ARGV).process
+ case t
+ when Numeric
+ $logger.info "Sleep: #{t}s"
+ sleep t
+ when Proc
+ t.call
+ else
+ $logger.fatal "Malformed timer: #{t}"
+ exit_code = 5
+ break
+ end
+ end
+ else
+ exit_code = w.process
+ end
+ exit exit_code
+ # sleep 5
+end
+
+
+
+# vi: ft=ruby:tw=72:ts=2:sw=4
+# Local Variables:
+# revisionRx: REVISION\s\+=\s\+\'
+# End:
30 index.txt
@@ -0,0 +1,30 @@
+% #VAR: css=tabbar-top.css|screen, +serif.css
+% #VAR: tabBarPos=top tabEqualWidths! noTabBarButtons!
+
+#VAR: css=tabbar-right.css|screen, article.css|print, +serif.css
+#VAR: autoindex! buttonsColour=blue buttonsHighlight!
+#VAR: encoding=latin-1
+
+#VAR: tabBarHomeName=OVERVIEW:
+#VAR: headings=plain autoFileNames!
+#VAR: urlIcon=remote.png mailtoIcon=mailto.png markerInFrontOfURL!
+#VAR: baseUrl=http://websitiary.rubyforge.org/ baseUrlStripDir=1
+#VAR: levelshift=-1 codeSyntax=ruby codeStyle=tomacs
+#Var id=tabBar <<--
+[auto]
+API: | http://websitiary.rubyforge.org/websitary/
+--
+
+#TITLE: websitary
+#AUTHOR: Thomas Link
+% #DATE: today
+#MAKETITLE
+#LIST plain! max=2: contents
+
+#INC inputFormat=rdoc: README.txt
+
+
+% 2007-09-01; @Last Change: 2007-09-16.
+% vi: ft=viki:tw=72:ts=4
+% Local Variables:
+% End:
555 lib/websitary.rb
@@ -0,0 +1,555 @@
+# websitary.rb
+# @Last Change: 2008-03-11.
+# Author:: Thomas Link (micathom AT gmail com)
+# License:: GPL (see http://www.gnu.org/licenses/gpl.txt)
+# Created:: 2007-09-08.
+
+
+require 'cgi'
+require 'digest/md5'
+# require 'ftools'
+require 'fileutils'
+require 'net/ftp'
+require 'optparse'
+require 'pathname'
+require 'rbconfig'
+require 'uri'
+require 'open-uri'
+require 'timeout'
+require 'yaml'
+require 'rss'
+
+['hpricot', 'robot_rules'].each do |f|
+ begin
+ require f
+ rescue Exception => e
+ $stderr.puts <<EOT
+#{e.message}
+Library could not be loaded: #{f}
+Please see the requirements section at: http://websitiary.rubyforge.org
+EOT
+ end
+end
+
+
+module Websitary
+ APPNAME = 'websitary'
+ VERSION = '0.5'
+ REVISION = '2476'
+end
+
+require 'websitary/applog'
+require 'websitary/filemtimes'
+require 'websitary/configuration'
+require 'websitary/htmldiff'
+
+
+# Basic usage:
+# Websitary::App.new(ARGV).process
+class Websitary::App
+ MINUTE_SECS = 60
+ HOUR_SECS = MINUTE_SECS * 60
+ DAY_SECS = HOUR_SECS * 24
+
+
+ # Hash: The output of the diff commands for each url.
+ attr_reader :difftext
+
+ # The configurator
+ attr_reader :configuration
+
+ # Secs until next update.
+ attr_reader :tdiff_min
+
+
+ # args:: Array of command-line (like) arguments.
+ def initialize(args=[])
+ @configuration = Websitary::Configuration.new(self, args)
+ @difftext = {}
+ @tdiff_min = nil
+
+ ensure_dir(@configuration.cfgdir)
+ css = File.join(@configuration.cfgdir, 'websitary.css')
+ unless File.exists?(css)
+ $logger.info "Copying default css file: #{css}"
+ @configuration.write_file(css, 'w') do |io|
+ io.puts @configuration.opt_get(:page, :css)
+ end
+ end
+ end
+
+
+ # Run the command stored in @execute.
+ def process
+ begin
+ m = "execute_#{@configuration.execute}"
+ if respond_to?(m)
+ exit_code = send(m)
+ else
+ $logger.fatal "Unknown command: #{@configuration.execute}"
+ exit_code = 5
+ end
+ ensure
+ @configuration.mtimes.swap_out
+ end
+ return exit_code
+ end
+
+
+ # Show the currently configured URLs
+ def execute_configuration
+ keys = @configuration.options.keys
+ urls = @configuration.todo
+ # urls = @configuration.todo..sort {|a,b| @configuration.url_get(a, :title, a) <=> @configuration.url_get(b, :title, b)}
+ urls.each_with_index do |url, i|
+ data = @configuration.urls[url]
+ text = [
+ "<b>URL</b><br/>#{url}<br/>",
+ "<b>current</b><br/>#{CGI.escapeHTML(@configuration.latestname(url, true))}<br/>",
+ "<b>backup</b><br/>#{CGI.escapeHTML(@configuration.oldname(url, true))}<br/>",
+ *((data.keys | keys).map do |k|
+ v = @configuration.url_get(url, k).inspect
+ "<b>:#{k}</b><br/>#{CGI.escapeHTML(v)}<br/>"
+ end)
+ ]
+ accumulate(url, text.join("<br/>"))
+ end
+ return show
+ end
+
+
+ def cmdline_arg_add(configuration, url)
+ configuration.to_do url
+ end
+
+
+ def execute_add
+ if @configuration.quicklist_profile
+ quicklist = @configuration.profile_filename(@configuration.quicklist_profile, false)
+ $logger.info "Use quicklist file: #{quicklist}"
+ if quicklist
+ @configuration.write_file(quicklist, 'a') do |io|
+ @configuration.todo.each do |url|
+ io.puts %{source #{url.inspect}}
+ end
+ end
+ return 0
+ end
+ end
+ $logger.fatal 'No valid quick-list profile defined'
+ exit 5
+ end
+
+
+ # Restore previous backups
+ def execute_unroll
+ @configuration.todo.each do |url|
+ latest = @configuration.latestname(url, true)
+ backup = @configuration.oldname(url, true)
+ if File.exist?(backup)
+ $logger.warn "Restore: #{url}"
+ $logger.debug "Copy: #{backup} => #{latest}"
+ copy(backup, latest)
+ end
+ end
+ return 0
+ end
+
+
+ # Edit currently chosen profiles
+ def execute_edit
+ @configuration.edit_profile
+ exit 0
+ end
+
+
+ # Show the latest report
+ def execute_review
+ @configuration.view_output
+ 0
+ end
+
+
+ # Show the current version of all urls
+ def execute_latest
+ @configuration.todo.each do |url|
+ latest = @configuration.latestname(url)
+ text = File.read(latest)
+ accumulate(url, text)
+ end
+ return show
+ end
+
+
+ # Rebuild the report from the already downloaded copies.
+ def execute_rebuild
+ execute_downdiff(true, true)
+ end
+
+
+ # Aggregate data for later review (see #execute_show)
+ def execute_aggregate
+ rv = execute_downdiff(false) do |url, difftext, opts|
+ if difftext and !difftext.empty?
+ aggrbase = @configuration.encoded_filename('aggregate', url, true, 'md5')
+ aggrext = Digest::MD5.hexdigest(Time.now.to_s)
+ aggrfile = [aggrbase, aggrext].join('_')
+ @configuration.write_file(aggrfile) {|io| io.puts difftext}
+ end
+ end
+ clean_diffs
+ rv
+ end
+
+
+ def execute_ls
+ rv = 0
+ @configuration.todo.each do |url|
+ opts = @configuration.urls[url]
+ name = @configuration.url_get(url, :title, url)
+ $logger.debug "Source: #{name}"
+ aggrbase = @configuration.encoded_filename('aggregate', url, true, 'md5')
+ aggrfiles = Dir["#{aggrbase}_*"]
+ aggrn = aggrfiles.size
+ if aggrn > 0
+ puts "%3d - %s" % [aggrn, name]
+ rv = 1
+ end
+ end
+ rv
+ end
+
+
+ # Show data collected by #execute_aggregate
+ def execute_show
+ @configuration.todo.each do |url|
+ opts = @configuration.urls[url]
+ $logger.debug "Source: #{@configuration.url_get(url, :title, url)}"
+ aggrbase = @configuration.encoded_filename('aggregate', url, true, 'md5')
+ difftext = []
+ aggrfiles = Dir["#{aggrbase}_*"]
+ aggrfiles.each do |file|
+ difftext << File.read(file)
+ end
+ difftext.compact!
+ difftext.delete('')
+ unless difftext.empty?
+ joindiffs = @configuration.url_get(url, :joindiffs, lambda {|t| t.join("\n")})
+ difftext = @configuration.call_cmd(joindiffs, [difftext], :url => url) if joindiffs
+ accumulate(url, difftext, opts)
+ end
+ aggrfiles.each do |file|
+ File.delete(file)
+ end
+ end
+ show
+ end
+
+
+ # Process the sources in @configuration.url as defined by profiles
+ # and command-line options. The differences are stored in @difftext (a Hash).
+ # show_output:: If true, show the output with the defined viewer.
+ def execute_downdiff(show_output=true, rebuild=false, &accumulator)
+ if @configuration.todo.empty?
+ $logger.error 'Nothing to do'
+ return 5
+ end
+ @configuration.todo.each do |url|
+ opts = @configuration.urls[url]
+ $logger.debug "Source: #{@configuration.url_get(url, :title, url)}"
+
+ diffed = @configuration.diffname(url, true)
+ $logger.debug "diffname: #{diffed}"
+
+ if File.exists?(diffed)
+ $logger.warn "Reuse old diff: #{@configuration.url_get(url, :title, url)} => #{diffed}"
+ difftext = File.read(diffed)
+ accumulate(url, difftext, opts)
+ else
+ latest = @configuration.latestname(url, true)
+ $logger.debug "latest: #{latest}"
+ next unless rebuild or !skip_url?(url, latest, opts)
+
+ older = @configuration.oldname(url, true)
+ $logger.debug "older: #{older}"
+
+ begin
+ if rebuild or download(url, opts, latest, older)
+ difftext = diff(url, opts, latest, older)
+ if difftext
+ @configuration.write_file(diffed, 'wb') {|io| io.puts difftext}
+ # $logger.debug "difftext: #{difftext}" #DBG#
+ if accumulator
+ accumulator.call(url, difftext, opts)
+ else
+ accumulate(url, difftext, opts)
+ end
+ end
+ end
+ rescue Exception => e
+ $logger.error e.to_s
+ $logger.info e.backtrace.join("\n")
+ end
+ end
+ end
+ return show_output ? show : @difftext.empty? ? 0 : 1
+ end
+
+
+ def move(from, to)
+ # copy_move(:rename, from, to) # ftools
+ copy_move(:mv, from, to) # FileUtils
+ end
+
+
+ def copy(from, to)
+ # copy_move(:copy, from, to)
+ copy_move(:cp, from, to)
+ end
+
+
+ def copy_move(method, from, to)
+ if File.exists?(from)
+ $logger.debug "Overwrite: #{from} -> #{to}" if File.exists?(to)
+ lst = File.lstat(from)
+ FileUtils.send(method, from, to)
+ File.utime(lst.atime, lst.mtime, to)
+ @configuration.mtimes.set(from, lst.mtime)
+ @configuration.mtimes.set(to, lst.mtime)
+ end
+ end
+
+
+ def format_tdiff(secs)
+ d = (secs / DAY_SECS).to_i
+ if d > 0
+ return "#{d}d"
+ else
+ d = (secs / HOUR_SECS).to_i
+ return "#{d}h"
+ end
+ end
+
+
+ def ensure_dir(dir, fatal_nondir=true)
+ if File.exist?(dir)
+ unless File.directory?(dir)
+ if fatal_nondir
+ $logger.fatal "Not a directory: #{dir}"
+ exit 5
+ else
+ $logger.info "Not a directory: #{dir}"
+ return false
+ end
+ end
+ else
+ parent = Pathname.new(dir).parent.to_s
+ ensure_dir(parent, fatal_nondir) unless File.directory?(parent)
+ Dir.mkdir(dir)
+ end
+ return true
+ end
+
+
+ private
+
+ def download(url, opts, latest, older=nil)
+ if @configuration.done.include?(url)
+ $logger.info "Already downloaded: #{@configuration.url_get(url, :title, url).inspect}"
+ return false
+ end
+
+ $logger.warn "Download: #{@configuration.url_get(url, :title, url).inspect}"
+ @configuration.done << url
+ text = @configuration.call_cmd(@configuration.url_get(url, :download), [url], :url => url)
+ # $logger.debug text #DBG#
+ unless text
+ $logger.warn "no contents: #{@configuration.url_get(url, :title, url)}"
+ return false
+ end
+
+ if opts
+ if (sleepsecs = opts[:sleep])
+ sleep sleepsecs
+ end
+ text = text.split("\n")
+ if (range = opts[:lines])
+ $logger.debug "download: lines=#{range}"
+ text = text[range] || []
+ end
+ if (range = opts[:cols])
+ $logger.debug "download: cols=#{range}"
+ text.map! {|l| l[range]}
+ text.compact!
+ end
+ if (o = opts[:sort])
+ $logger.debug "download: sort=#{o}"
+ case o
+ when true
+ text.sort!
+ when Proc
+ text.sort!(&o)
+ end
+ end
+ if (o = opts[:strip])
+ $logger.debug "download: strip!"
+ text.delete_if {|l| l !~ /\S/}
+ end
+ text = text.join("\n")
+ end
+
+ pprc = @configuration.url_get(url, :downloadprocess)
+ if pprc
+ $logger.debug "download process: #{pprc}"
+ text = @configuration.call_cmd(pprc, [text], :url => url)
+ # $logger.debug text #DBG#
+ end
+
+ if text and !text.empty?
+ if older
+ if File.exist?(latest)
+ move(latest, older)
+ elsif !File.exist?(older)
+ $logger.warn "Initial copy: #{latest.inspect}"
+ end
+ end
+ @configuration.write_file(latest) {|io| io.puts(text)}
+ return true
+ else
+ return false
+ end
+ end
+
+
+ def diff(url, opts, new, old)
+ if File.exists?(old)
+ $logger.debug "diff: #{old} <-> #{new}"
+ difftext = @configuration.call_cmd(@configuration.url_get(url, :diff), [old, new], :url => url)
+ # $logger.debug "diff: #{difftext}" #DBG#
+
+ if difftext =~ /\S/
+ if (pprc = @configuration.url_get(url, :diffprocess))
+ $logger.debug "diff process: #{pprc}"
+ difftext = @configuration.call_cmd(pprc, [difftext], :url => url)
+ end
+ # $logger.debug "difftext: #{difftext}" #DBG#
+ if difftext =~ /\S/
+ $logger.warn "Changed: #{@configuration.url_get(url, :title, url).inspect}"
+ return difftext
+ end
+ end
+
+ $logger.debug "Unchanged: #{@configuration.url_get(url, :title, url).inspect}"
+
+ elsif File.exist?(new) and
+ (@configuration.url_get(url, :show_initial) or @configuration.optval_get(:global, :show_initial))
+
+ return File.read(new)
+
+ end
+ return nil
+ end
+
+
+ def skip_url?(url, latest, opts)
+ if File.exists?(latest) and !opts[:ignore_age]
+ tn = Time.now
+ tl = @configuration.mtimes.mtime(latest)
+ td = tn - tl
+ tdiff = tdiff_with(opts, tn, tl)
+ case tdiff
+ when nil, false
+ $logger.debug "Age requirement fulfilled: #{@configuration.url_get(url, :title, url).inspect}: #{format_tdiff(td)} old"
+ return false
+ when :skip, true
+ $logger.info "Skip #{@configuration.url_get(url, :title, url).inspect}: Only #{format_tdiff(td)} old"
+ return true
+ when Numeric
+ if td < tdiff
+ tdd = tdiff - td
+ @tdiff_min = tdd if @tdiff_min.nil? or tdd < @tdiff_min
+ $logger.info "Skip #{@configuration.url_get(url, :title, url).inspect}: Only #{format_tdiff(td)} old (#{format_tdiff(tdiff)})"
+ return true
+ end
+ else
+ $logger.fatal "Internal error: tdiff=#{tdiff.inspect}"
+ exit 5
+ end
+ end
+ end
+
+
+ def tdiff_with(opts, tn, tl)
+ if (hdiff = opts[:hours])
+ tdiff = hdiff * HOUR_SECS
+ $logger.debug "hours: #{hdiff} (#{tdiff}s)"
+ elsif (daily = opts[:daily])
+ tdiff = tl.year == tn.year && tl.yday == tn.yday
+ $logger.debug "daily: #{tl} <=> #{tn} (#{tdiff})"
+ elsif (dweek = opts[:days_of_week] || opts[:wdays])
+ tdiff = tdiff_x_of_y(dweek, tn.wday, tn.yday / 7, tl.yday / 7)
+ $logger.debug "wdays: #{dweek} (#{tdiff})"
+ elsif (dmonth = opts[:days_of_month] || opts[:mdays])
+ tdiff = tdiff_x_of_y(dmonth, tn.day, tn.month, tl.month)
+ $logger.debug "mdays: #{dmonth} (#{tdiff})"
+ elsif (ddiff = opts[:days])
+ tdiff = ddiff * DAY_SECS
+ $logger.debug "days: #{ddiff} (#{tdiff}s)"
+ elsif (dmonth = opts[:months])
+ tnowm = tn.month + 12 * (tn.year - tl.year)
+ tlm = tl.month
+ tdiff = (tnowm - tlm) < dmonth
+ $logger.debug "months: #{dmonth} (#{tdiff})"
+ else
+ tdiff = false
+ end
+ return tdiff
+ end
+
+
+ def tdiff_x_of_y(eligible, now, parent_eligible, parent_now)
+ if parent_eligible == parent_now
+ return true
+ else
+ case eligible
+ when Array, Range
+ return !eligible.include?(now)
+ when Integer
+ return eligible != now
+ else
+ $logger.error "#{@configuration.url_get(url, :title, url)}: Wrong type for :days_of_week=#{dweek.inspect}"
+ return :skip
+ end
+ end
+ end
+
+
+ def accumulate(url, difftext, opts=nil)
+ # opts ||= @configuration.urls[url]
+ @difftext[url] = difftext
+ end
+
+
+ def show
+ begin
+ return @configuration.show_output(@difftext)
+ ensure
+ clean_diffs
+ end
+ end
+
+
+ def clean_diffs
+ Dir[File.join(@configuration.cfgdir, 'diff', '*')].each do |f|
+ $logger.debug "Delete saved diff: #{f}"
+ File.delete(f)
+ end
+ end
+
+end
+
+
+
+# Local Variables:
+# revisionRx: REVISION\s\+=\s\+\'
+# End:
39 lib/websitary/applog.rb
@@ -0,0 +1,39 @@
+# applog.rb
+# @Last Change: 2007-09-11.
+# Author:: Thomas Link (micathom AT gmail com)
+# License:: GPL (see http://www.gnu.org/licenses/gpl.txt)
+# Created:: 2007-09-08.
+
+require 'logger'
+
+
+# A simple wrapper around Logger.
+class Websitary::AppLog
+ def initialize(output=nil)
+ @output = output || $stdout
+ $logger = Logger.new(@output, 'daily')
+ $logger.progname = Websitary::APPNAME
+ $logger.datetime_format = "%H:%M:%S"
+ set_level
+ end
+
+
+ def set_level(level=:default)
+ case level
+ when :debug
+ $logger.level = Logger::DEBUG
+ when :verbose
+ $logger.level = Logger::INFO
+ when :quiet
+ $logger.level = Logger::ERROR
+ else
+ $logger.level = Logger::WARN
+ end
+ $logger.debug "Set logger level: #{level}"
+ end
+end
+
+
+# Local Variables:
+# revisionRx: REVISION\s\+=\s\+\'
+# End:
1,903 lib/websitary/configuration.rb
1,903 additions, 0 deletions not shown because the diff is too large. Please use a local Git client to view these changes.
58 lib/websitary/filemtimes.rb
@@ -0,0 +1,58 @@
+# filemtimes.rb
+# @Last Change: 2007-09-16.
+# Author:: Thomas Link (micathom AT gmail com)
+# License:: GPL (see http://www.gnu.org/licenses/gpl.txt)
+# Created:: 2007-09-08.
+
+
+# require 'ftools'
+require 'yaml'
+
+
+class Websitary::FileMTimes
+ def initialize(configuration)
+ @configuration = configuration
+ @store = File.join(@configuration.cfgdir, 'mtime.yml')
+ @data = {}
+ swap_in
+ end
+
+ def swap_in
+ if File.exist?(@store)
+ @data = YAML.load_file(@store)
+ case @data
+ when Hash
+ else
+ $logger.error 'mtime.yml stored malformed data'
+ @data = {}
+ end
+ File.delete(@store)
+ end
+ end
+
+ def swap_out
+ File.open(@store, 'w') {|f| YAML.dump(@data, f)}
+ end
+
+ def mtime(filename)
+ filenamec = @configuration.canonic_filename(filename)
+ @data[filenamec] ||= set(filename)
+ end
+
+ def set(filename, mtime=nil)
+ if File.exist?(filename)
+ mtime ||= File.mtime(filename)
+ filenamec = @configuration.canonic_filename(filename)
+ @data[filenamec] = mtime
+ $logger.debug "Set mtime: #{filename} -> #{mtime.to_s}"
+ mtime
+ else
+ nil
+ end
+ end
+end
+
+
+# Local Variables:
+# revisionRx: REVISION\s\+=\s\+\'
+# End:
160 lib/websitary/htmldiff.rb
@@ -0,0 +1,160 @@
+#!/usr/bin/env ruby
+# htmldiff.rb
+# @Last Change: 2007-11-10.
+# Author:: Thomas Link (micathom at gmail com)
+# License:: GPL (see http://www.gnu.org/licenses/gpl.txt)
+# Created:: 2007-08-17.
+#
+# == Basic Use
+# htmldiff OLD NEW [HIGHLIGHT-COLOR] > DIFF
+
+require 'hpricot'
+
+
+module Websitary
+ # A simple class to generate diffs for html files using hpricot.
+ # It's quite likely that it will miss certain details and yields
+ # wrong results (especially wrong-negative) in certain occasions.
+ class Htmldiff
+ VERSION = '0.1'
+ REVISION = '180'
+
+ # args:: A hash
+ # Fields:
+ # :oldtext:: The old version
+ # :newtext:: The new version
+ # :highlight:: Don't strip old content but highlight new one with this color
+ # :args:: Command-line arguments
+ def initialize(args)
+ @args = args
+ @high = args[:highlight] || args[:highlightcolor]
+ @old = explode(args[:olddoc] || Hpricot(args[:oldtext] || File.read(args[:oldfile])))
+ @new = args[:newdoc] || Hpricot(args[:newtext] || File.read(args[:newfile]))
+ @ignore = args[:ignore]
+ if @ignore and !@ignore.kind_of?(Enumerable)
+ die "Ignore must be of kind Enumerable: #{ignore.inspect}"
+ end
+ @changed = false
+ end
+
+
+ # Do the diff. Return an empty string if nothing has changed.
+ def diff
+ rv = process.to_s
+ @changed ? rv : ''
+ end
+
+
+ # It goes like this: if a node isn't in the list of old nodes either
+ # the node or its content has changed. If the content is a single
+ # node, the whole node has changed. If only some sub-nodes have
+ # changed, collect those.
+ def process(node=@new)
+ acc = []
+ node.each_child do |child|
+ ch = child.to_html.strip
+ next if ch.nil? or ch.empty?
+ if @old.include?(ch) or ignore(child, ch)
+ if @high
+ acc << child
+ end
+ else
+ if child.respond_to?(:each_child)
+ acc << process(child)
+ else
+ acc << highlight(child).to_s
+ acc << '<br />' unless @high
+ end
+ end
+ end
+ replace_inner(node, acc.join("\n"))
+ end
+
+
+ def ignore(node, node_as_string)
+ return @ignore && @ignore.any? do |i|
+ case i
+ when Regexp
+ node_as_string =~ i
+ when Proc
+ l.call(node)
+ else
+ die "Unknown type for ignore expression: #{i.inspect}"
+ end
+ end
+ end
+
+
+ # Collect all nodes and subnodes in a hpricot document.
+ def explode(node)
+ if node.respond_to?(:each_child)
+ acc = [node.to_html.strip]
+ node.each_child do |child|
+ acc += explode(child)
+ end
+ acc
+ else
+ [node.to_html.strip]
+ end
+ end
+
+
+ def highlight(child)
+ @changed = true
+ if @high
+ if child.respond_to?(:each_child)
+ acc = []
+ child.each_child do |ch|
+ acc << replace_inner(ch, highlight(ch).to_s)
+ end
+ replace_inner(child, acc.join("\n"))
+ else
+ case @args[:highlight]
+ when String
+ opts = %{class="#{@args[:highlight]}"}
+ when true, Numeric
+ opts = %{class="highlight"}
+ else
+ opts = %{style="background-color: #{@args[:highlightcolor]};"}
+ end
+ ihtml = %{<span #{opts}>#{child.to_s}</span>}
+ replace_inner(child, ihtml)
+ end
+ else
+ child
+ end
+ end
+
+
+ def replace_inner(child, ihtml)
+ case child
+ when Hpricot::Comment
+ child
+ when Hpricot::Text
+ Hpricot(ihtml)
+ else
+ child.inner_html = ihtml
+ child
+ end
+ end
+
+ end
+end
+
+
+if __FILE__ == $0
+ old, new, aargs = ARGV
+ if old and new
+ args = {:args => aargs, :oldfile => old, :newfile => new}
+ args[:highlightcolor], _ = aargs
+ acc = Websitary::Htmldiff.new(args).diff
+ puts acc
+ else
+ puts "#{File.basename($0)} OLD NEW [HIGHLIGHT-COLOR] > DIFF"
+ end
+end
+
+
+# Local Variables:
+# revisionRx: REVISION\s\+=\s\+\'
+# End:

0 comments on commit 3e248db

Please sign in to comment.
Something went wrong with that request. Please try again.