Skip to content
This repository

webcorpus pipeline

branch: master
README
This project is a collection of scripts and programs for creating a webcorpus
from crawled data.
The input data is extracted by the wire crawler
(http://www.cwr.cl/projects/WIRE/) and the output is a text file with document
separators and raw text
A sample output data and the published article can be found at the homepage
of our research group at http://hlt.sztaki.hu/resources/webcorpora.html

Dependencies:
- gcc
- flex
- libtextcat
  - http://software.wise-guys.nl/libtextcat/
  - WARNING: libtextcat unfortunately uses a predefined confidence parameter,
    which cannot be changed at runtime. we created a little patch fixing this,
    so before compiling and installing libtextcat, please apply our patch
    at src/libtextcat-2.2.patch with
    - patch -p1 < /path/to/libtextcat-2.2.patch # while being in a directory
      that contains original libtextcat-2.2; or
    - patch -p2 < /path/to/libtextcat-2.2.patch # while being in the actual
      libtextcat-2.2 directory
- libhunspell
  - we only used libhunspell-1.3 for testing

Usage:
- src/Makefile is for building everything and move binaries to bin/ folder
  make for building; make clean for cleaning
- dat/Makefile is for processing data, see Makefile for details
  example:
  make finnish.raw; make finnish.parsed; make finnish.senfiltered; etc.
- some scenarios:
  - crawl ended without a problem and data is extracted with wire-info-extract
    - result is in mylang.wire
    - run these commands in dat/:
      - make mylang.raw
      - make mylang.parsed
      - make mylang.senfiltered
      - make mylang.langfiltered
      - make mylang.dedup
      - make mylang.neardedup
      - make mylang.tok
      - make mylang.freq
      - make mylang.stemdict
    - OR simply (because of make discovers dependencies):
      - make mylang.tokenized
  - there were some troubles in crawling
    - rename Data/text/storage.raw in crawling directory to
    dat/mylang.not_extractable_wire and run these commands:
      - make mylang.raw
      - make mylang.parsed
      - make mylang.senfiltered
      - make mylang.langfiltered
      - make mylang.cleaned # see WIRE troubles later in this file
      - make mylang.cleaned_langfiltered # see WIRE troubles later in this file
      - make mylang.dedup
      - make mylang.neardedup
      - make mylang.tok
      - make mylang.freq
      - make mylang.stemdict
    - this cannot be shortened with "make mylang.tokenized" because
      then optional cleaning won't run; this depends on user needs and data

- if processing a language, one has to edit dat/Makefile and add hunspell
  dictionaries in the same format  in which there are already some,
  or if there is no hunspell dictionary for a given language (like finnish),
  then use libtextcat (dat/Makefile is prepared to do that automatically,
  when there is no dictionary given) and edit dat/Makefile in 
  TEXTCAT_CONFIG an TEXTCAT_CONF_LIMIT variables.

WIRE troubles
- when setting maxdoc in xml config too high, crawling really slows down after
  a few rounds. See crawling tutorial on wiki page at github for details
- There are some encoding issues in wire (especially when the run didn't end
  correctly) such as:
  - html is not in utf-8 even though it is supposed to be (set in wire.conf)
  - some htmls lie about their encoding, so some characters
    will get messed up, and it cannot be fixed with a simple iconv anymore
    The main problem is when a multibyte utf-8 character is handled as more
    characters in a 1-byte encoding, so that ugly things happen
- when WIRE crashes while running gatherer or seeder, crawling cannot be
  continued and data can only be extracted in a hard way.
  wire-data/text/storage.raw has to be renamed to mylang.not_extractable_wire
  and process data afterwards (see one scenario above)
- because of these problems, there is a "clean encoding" phase in dat/Makefile
  after language filtering, and the reason of its location is that our cleaning
  procedure uses clean data for character statistics, and it is extracted
  from data that is already in a specific language judged by hunspell or
  textcat. After cleaning, a second language filtering runs because it now will
  give you more good data in the given language.
Something went wrong with that request. Please try again.