Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
A lightweight, low-memory character encoding detector suitable for massive files.
branch: master

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.
src
test
.gitignore
COPYING
Makefile
README

README

detenc
A lightweight, low-memory character encoding detector.
Paul Battley <pbattley@gmail.com>
http://www.reevoo.com/

detenc is a fast character encoding detector for Western European text.  It can
determine whether a file is encoded in US-ASCII, UTF-8, ISO-8859-15,
WINDOWS-1252, or something else. It can distinguish ISO-8859-15 and
WINDOWS-1252 where there is enough information: this means that Euro signs are
handled correctly.

The program was written to help normalise the encoding of very large data feeds
(of the order of several gigabytes) at Reevoo. It uses very little memory and
can determine the encoding of a two-gigabyte file in under a minute.

Build

The program is written in C and uses standard libraries. The test suite
requires Ruby (1.8 or 1.9) and Rake to be available

  make         # builds binary
  make check   # runs test suite
  make install # installs binary to /usr/local/bin

Use

  detenc FILENAME (FILENAME ...)

This will output the filename and encoding, one per line. To print just the
encoding, use the -q switch.
Something went wrong with that request. Please try again.