Skip to content
Switch branches/tags

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

texrex web page cleaning system

Web document cleaning/crawl file processing tools written by Roland Schäfer for the COW project.

Current milestone: texrex-behindthecow (2015)

texrex is a free software for processing data files from crawls and turn them into a corpus of web documents. Currently, it is limited to reading ARC and WARC files, but other input modules can be developed quickly.

It performs the following processing steps:

  • read WARC or ARC files document by document
  • filter perfect duplicates using a Bloom filter
  • strip HTML, scripts, stylesheets
  • extract meta information from crawl headers
  • normalize encodings to UTF-8 (using ICU), optionally treating all ISO-8859-1 input as Win-1252
  • convert all HTML entities to appropriate codepoints (including rogue Win-1252)
  • detect, remove, and/or annotate boilerplate blocks using state-of-the-art classifier (
  • assess the text quality of the documents by looking at frequencies of short frequent word (
  • create w-shingling document fingerprints and filter near-duplicate documents
  • perform in-document deduplication (remove repeated paragraphs, insert a backreference to first copy)
  • perform additional normalization (e.g., reduce diverse Unicode dashes and hyphens to the basic codepoint)
  • write standard-compliant XML output
  • add server IP geolocation meta information (country, region, city – currently based on GeoLite)

Technologically, the main features of texrex are:

  • written in FreePascal (Object FPC mode)
  • licensed under permissive 2-clause BSD license
  • uses multi-threading for single-machine parallelization
  • uses simple INI files to configure processing jobs for the main tool
  • can be run in the background, using an included IPC client to control the process
  • depends only on two additional libraries: ICU and FANN

New tools included since texrex-neuedimensionen (June 2014):

  • HyDRA hard hyphenation remover
  • rofl tool to fix run-together sentences

Papers to read about texrex and related technology:

This repo was moved to GitHub from SourceForge r. 662 on May 1, 2016. For reference:


texrex web page cleaning & ClaraX random walk crawler




No releases published


No packages published