GitHub - sofarrar/BitextMiner: to mine Russ./Eng. sites for translations

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README		README

Repository files navigation

CLMA Summer 2010
Project "Bitext Miner"
Alexander Shchetnikov

1. Tasks
2. Crawling websites
3. Extracting bitext
4. Directories

1. Tasks

The goal of the project is to crawl websites containing English text with Russian translations, 
to extract such bitext data and postprocess it.

2. Crawling websites

As initial seeds, four websites were chosen for crawling:

http://usefulenglish.ru
http://englishouse.ru
http://www.proz.com
http://homeenglish.ru

They appeared to be most informative and they seemed to have maps favourable for processing.
Two open source software were considered for crawling: Nutch and Wget. Wget appeared to be
most suitable - because of its simplicity and crawling output options.

The two Wget commands were used:
	wget -i file (to download URLs specified in the file)
	wget -r -l1 -np http://page1/page2/ (to download recursively all URLs, 1 level down from page2)
Minor modification of these commands were tested (increasing level of download, adding option -w 5,
which is 5 seconds pause between retrievals, - useful to crawl http://www.proz.com which has 
retrievals per time limitations).

Downloaded URLs were organized into folders corresponding each website. Some files were put directly
in the folder, some files, as a result of a recursive download, were put in a tree of subfolders.

Scripts for crawling are: crawl_englishouse_homeenglish.sh, crawl_proz_usefulenglish.sh - they require
a file with urls as an argument (crawl_englishouse_homeenglish.sh urls_englishouse.txt), 
the files with urls are provided.


3. Extracting bitext

Processing downloaded files for bitext data had some challenges:
- Encoding of pages differed - two sites had UTF-8, the other two sites had windows-1251.
- each site had its own structure, bitext was placed differently
	Examples:
		<td><p>bitext_english</p></td> (englishouse.ru)
		<td><p>bitext_russian</p></td> (englishouse.ru)
		<p>bitext_english</p>(usefulenglish.ru)
		<p>bitext_russian</p>(usefulenglish.ru)
		<td>bitext_english</td>(homeenglish.ru)
		<td>bitext_russian</td>(homeenglish.ru)
		....<tr class=...>bitext_english<....<tr class=...>bitext_russian<....(www.prox.com)
The structure differed also within each site (for example, one page has a 2-column table, 
another page of the same site has a 3-column table). 
- downloaded files were stored in trees with different levels.

To extract as much data and, at the same time, keep the code as simple as possible, I wrote four 
different code - one for each website. If needed, these 4 codes could be combined.

Codes are written in Java, Filter.java, Filter1.java, Filter2.java, Filter3.java. 
All of them accept three files as input - path to the folder, name of the file for english output
and name of the file for russian output. Some codes create working files (.txt) for 
intermediate processing.

Processing of the four sites yielded between 30 and 32 thousand bilingual pairs. "Bad" data 
(swapped languages, incorrect translation, no content, etc.) is estimated to be about 1% of it, 
and it is produced mostly after processing www.proz.com.

4. Directories

On patas directories are arranged in the following order:

The root directory is /home2/shchea/project. It contains scripts for crawling, code projects folders
(Filter, Filter1, Filter2, Filter3), urls files (input for crawling), folders with downloaded raw data 
(englishouse.ru, homeenglish.ru, usefulenglish.ru, www.proz.com), output_data folder (with bitext data
for each site in four subfolders - output_data/englishouse.ru, output_data/homeenglish.ru, etc.; 
I named each file with English data as bitext_en.txt and each file with Russian data as bitext_ru.txt);
and this README file.

On github the folders are:
- BitextMiner is the root, containing README, files with seed urls and folders input/, output/ and src/
- input/ has 4 subfolders with crawled data from the four websites
- output/ has also 4 subfolders with processed bitext data
- src/ contains scripts and subfolders with codes (Filter, Filter1, Filter2, Filter3)