Skip to content

takahashim/docdiff_old

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

docdiff Readme

2000-12-09..2006-02-03 Hisashi MORITA

News

0.3.3 (2006-02-03)

Fixed arg test so that we can compare non-normal files, such as device files and named pipes (thanks to Shugo Maeda). Added DocDiff Web UI sample (experimental). Fixed HTML output to produce valid XHTML (thanks to Hiroshi OHKUBO). Note that CSS in HTML output is slightly changed. Replaced underscores(_) in CSS class names to hyphens(-) so that older UAs can understand them (thanks to Kazuhiro NISHIYAMA).

0.3.2 (2005-01-03)

Readme is multilingualized (added partial Japanese translation). Try switching CSS between en and ja. Monolingual files are also available (readme.en.html, readme.ja.html). Outputs better error messages when it failed to auto-detect the encoding and/or eol, though the accuracy is the same. Switched revision control system from CVS to Subversion.

0.3.1 (2004-08-29)

Added -L (–label) option place holder in order to be used as external diff program from Subversion.

0.3.0 (2004-05-29)

Re-designed and re-written from scratch. Supports multiple encodings (ASCII, EUC-JP, Shift_JIS, UTF-8) and multiple eols (CR, LF, CRLF). Supports more output formats (tty, HTML, Manued, wdiff-like, user-defined markup text). Supports configuration files (/etc/docdiff/docdiff.conf, ~/etc/docdiff/docdiff.conf (or ~/.docdiff/docdiff.conf)). Introduced digest (summary) mode. Approximately 200% faster than older versions, thanks to akr’s diff library. Better documentation and help message. License changed from Ruby’s to modified BSD style. Pure Ruby. Does not require external diff program such as GNU diff, or morphological analyzer such as ChaSen. Runs on both Unix and Windows (tested on Debian GNU/Linux and Cygwin) Unit tests introduced to decrease bugs and to encourage faster development. Makefile introduced.

0.1.8 (2003-12-14)

Displays warning when –bymorpheme is specified but ChaSen is not available (patch by Akira YAMADA: Debian bug #192258). Supports system-wide configuration file (if ~/.chasenrc.docdiff does not exist, reads /etc/docdiff/chasenrc) (patch by Akira YAMADA: Debian bug #192261).

0.1.7 (2003-11-21)

HTML output retains spaces (  patch by Akira YAMADA). Manued output is added. Use –manued command line option to get result in Manued-like format.

Fixed .chasenrc.docdiff to be compatible with the latest ChaSen, so that it does not cause error.

Alphabet words in the output may look ugly, since ChaSen does not keep spaces between alphabetical words recently.
Other minor bug fixes and code cleanup.
0.2.0b2 (2001-08-31)

Code cleanup.

0.2.0b1 (2001-08-31)

A bit faster than 0.1.x, using file cache. A bit cleaner code.

0.1.6 (2001-05-16)

Increased diff option number from 100000 to 1000000 in order to support 900KB+ text files.

0.1.5 (2001-01-17)

Erased useless old code which were already commented out. Added documentation. (Updated README, more comments) First public release. Registered to RAA.

0.1.4 (2001-01-16)

Output is like <tag>ab</tag>, instead of ugly <tag>a</tag><tag>b</tag> (thanks again to Masatoshi Seki for suggestion). Fixed hidden bug (‘puts’ was used to output result). Some code clean-up, though still hairy enough.

0.1.3 (2001-01-09)

Tested with Ruby 1.6.2. Fixed “meth(a,b,)” bug (thanks to Masatoshi Seki). Switched development platform from Windows to Linux, but it should work fine on Windows too, except for ChaSen stuff.

0.1.2 (2000-12-28)

Mostly bug fix.

0.1.1 (2000-12-25)

Bug fix and some cleanup. Quotes some of HTML special characters (<>&) when output in HTML. Added support for tty output using escape sequence.

0.1.0 (2000-12-19)

ChaSen works fine now. GetOptLong was introduced to support command line options.

0.1.0a1 (2000-12-16)

Added ChaSen support. Japanese word by word comparison requires ChaSen. Converted scripts from Shift_JIS/CRLF to EUC-JP/LF.

0.0.2 (2000-12-10)

Rewritten to use class.

0.0.1 (2000-12-09)

First version. Proof-of-concept. Supports ASCII, EUC-JP, LF only. Supports HTML output only. Requires GNU diff. Distributed under the same license as Ruby’s See the ChangeLog for detail.

Todo

  • Fixing bugs.

  • Incorporate user feedback.

  • Better auto-recognition of encodings and eols.

  • Make CSS and tty escape sequence customizable in config files.

  • Make stand-alone Windows application using Exerb, hopefully with fancy GUI.

  • Multilingualization (M17N), when Ruby is multilingualized.

  • Write “DocPatch”.

  • Description

  • Compares two text files word by word / char by char

Summary

DocDiff compares two text files and shows the difference. It can compare files word by word, char by char, or line by line. It has several output formats such as HTML, tty, Manued, or user-defined markup.

It supports several encodings and end-of-line characters, including ASCII, UTF-8, EUC-JP, Shift_JIS, CR, LF, and CRLF.

Requirement

  • Ruby

(Note that you may need additional ruby library such as iconv and uconv, if your OS’s Ruby package does not include those.)

Installation

Note that you need appropriate permission for proper installation (you may have to have a root/administrator privilege).

Place docdiff/ directory and its contents to ruby library directory, so that ruby interpreter can load them.

(e.g.
# cp -r docdiff /usr/lib/ruby/1.8
)

Place docdiff.rb in command binary directory.

(e.g.
# cp docdiff.rb /usr/bin/
)

Optionally you may want to rename it to docdiff.

(e.g.
# mv /usr/bin/docdiff.rb /usr/bin/docdiff
)

Set appropriate permission.

(e.g.
# chmod +x /usr/bin/docdiff.rb
)

(Optional) If you want site-wide configuration file, place docdiff.conf.example as /etc/docdiff/docdiff.conf and edit it.

(e.g.
# cp docdiff.conf.example /etc/docdiff.conf
# $EDITOR /etc/docdiff.conf
)

(Optional) If you want per-user configuration file, place docdiff.conf.example as ~/etc/docdiff/docdiff.conf and edit it.

(e.g.
% cp docdiff.conf.example ~/etc/docdiff.conf
% $EDITOR ~/etc/docdiff.conf
)

Usage

Synopsis

docdiff [options] oldfile newfile

e.g. % docdiff old.txt new.txt > diff.html

See the help message for detail (docdiff –help).

Example

% cat sample/01.en.ascii.lf
Hello, my name is Watanabe.
I am just another Ruby porter.
% cat sample/02.en.ascii.lf
Hello, my name is matz.
It's me who has created Ruby. I am a Ruby hacker.
% docdiff sample/01.en.ascii.lf sample/02.en.ascii.lf
Hello, my name is Watanabe.matz.
It's me who has created Ruby.  I am just another a Ruby porter.hacker.
%

Configuration

You can place configuration files at:

  • /etc/docdiff/docdiff.conf (site-wide configuration)

  • ~/etc/docdiff/docdiff.conf (user configuration)

    • (~/etc/docdiff/docdiff.conf is used by default in order to keep home directory clean, preventing dotfiles and dotdirs from scattering around. Alternatively, you can use ~/.docdiff/docdiff.conf as user configuration file name, following the traditional Unix convention.)

Notation is as follows (also refer to the file docdiff.conf.example included in the distribution archive):

# comment
key1 = value
key2 = value
...

Every value is treated as string, unless it seems like a number. In such case, value is treated as a number (usually an integer).

Troubleshooting and Tips

wrong argument type nil (expected Module) (TypeError)

Sometimes DocDiff fails to auto-recognize encoding and/or end-of-line character. You may get an error like this.

charstring.rb:47:in `extend': wrong argument type nil (expected Module) (TypeError)

In such a case, try explicitly specifying encoding and end-of-line character (e.g. docdiff –utf8 –crlf).

Inappropriate Insertion / Deletion

When comparing space-separated texts (such as English or program source code), the word next to the end of line is sometimes unnecessarily deleted and inserted. This is due to the limitation of DocDiff’s word splitter. It splits strings into words like the following.

text 1:

foo bar
("foo bar" => ["foo ", "bar"])

text 2:

foo
bar
("foo\nbar" => ["foo", "\n", "bar"])

comparison result:

foo foo
bar
("<del>foo </del><ins>foo</ins><ins>\n</ins>bar")

Foo is (unnecessarily) deleted and inserted at the same time.

I would like to fix this sometime, but it’s not easy. If you split single space as single element (i.e. [“foo”, “ ”, “bar”]), the word order of the comparison result will be less natural. Suggestions are welcome.

Using DocDiff with Version Control Systems

If you want to use DocDiff as an external diff program from Subversion, the following may work.

% svn diff --diff-cmd=docdiff --extensions "--ascii --lf --tty --digest"

With zsh, you can use DocDiff or other utility to compare arbitrary sources. In the following example, we compare specific revision of foo.html in a repository with one on a website.

Comparing Non-plain Text Files Such As HTML or Microsoft Word Documents

You can compare files other than plain text, such as HTML and Microsoft Word documents, if you use appropriate converter.

Comparing the content of two HTML documents (without tags)
:
% docdiff =(w3m -dump -cols 10000 foo.html) =(w3m -dump -cols 10000 http://www.example.org/foo.html)

Comparing the content of two Microsoft Word documents
:
% docdiff =(wvWare foo.doc | w3m -T text/html -dump -cols 10000) =(wvWare bar.doc | w3m -T text/html -dump -cols 10000)

License

This software is distributed under so-called modified BSD style license (www.opensource.org/licenses/bsd-license.php (without advertisement clause)). By contributing to this software, you agree that your contribution may be incorporated under the same license.

Copyright and condition of use of main portion of the source:

Copyright (C) Hisashi MORITA.  All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:
1. Redistributions of source code must retain the above copyright
   notice, this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright
   notice, this list of conditions and the following disclaimer in the
   documentation and/or other materials provided with the distribution.
3. Neither the name of the University nor the names of its contributors
   may be used to endorse or promote products derived from this software
   without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
SUCH DAMAGE.

diff library (docdiff/diff.rb and docdiff/diff/*) was originally a part of Ruby/CVS by Akira TANAKA. Ruby/CVS is licensed under modified BSD style license. See the following for detail.

Credits

Hisashi MORITA (primary author) Acknowledgments Akira TANAKA (diff library author) Shin’ichiro HARA (initial idea and algorithm suggestion) Masatoshi SEKI (patch) Akira YAMADA (patch, Debian package) Kenshi MUTO (testing, bug report, Debian package) Kazuhiro NISHIYAMA (bug report) Hiroshi OHKUBO (bug report) Shugo MAEDA (bug report)

Resources

Format

  • HTML/XHTML

    http://www.w3.org
    
  • tty (Graphic rendition using VT100 / ANSI escape sequence)

    VT100: http://vt100.net/docs/tp83/appendixb.html
    ANSI: http://www.tldp.org/HOWTO/Bash-Prompt-HOWTO/x329.html
  • Manued (Manuscript Editing language: a proofreading method for text)

    http://www.archi.is.tohoku.ac.jp/~yamauchi/otherprojects/manued/index.shtml
    

Similar Software

There are several other software that can compare text word by word and/or character by character.

  • GNU wdiff (Seems to support single byte characters only.)

    http://www.gnu.org/directory/GNU/wdiff.html
    
  • cdif by Kazumasa UTASHIRO (Supports several Japanese encodings.)

    http://srekcah.org/~utashiro/perl/scripts/cdif
    
  • ediff for Emacsen

    http://www.xemacs.org/Documentation/packages/html/ediff.html
    
  • diff-detail for xyzzy, by Hiroshi OHKUBO

    http://ohkubo.s53.xrea.com/xyzzy/index.html#diff-detail
    
  • Manuediff (Outputs difference in Manued format.)

    http://hibiki.miyagi-ct.ac.jp/~suzuki/comp/export/manuediff.html
    
  • YASDiff (Yet Another Scheme powered diff) by Y. Fujisawa

    http://nnri.dip.jp/~yf/cgi-bin/yaswiki2.cgi?name=YASDiff&parentid=0

About

TODO: one-line summary of your gem

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages