Skip to content

tpimentelms/wiki-tokenizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

wiki-tokenizer

CircleCI

Code to download and tokenize wikipedia data.

Install

You can install wikitokenizer directly from PyPI:

pip install wikitokenizer

Or from source:

git clone https://github.com/tpimentelms/wikitokenizer.git
cd wikitokenizer
pip install --editable .

Dependencies

Wiki tokenizer has the following main requirements:

Usage

To download and tokenize wikipedia data for a specific language in Wiki40B:

$ tokenize_wiki_40b --language <wikipedia_language_code> --tgt-dir <tgt_dir> --break-text-mode <break_text_mode>

Where <wikipedia_language_code> is the language code in wikipedia for the desired language, <tgt_dir> is the directory where data should be saved, and <break_text_mode> is either 'document', paragraph or sentence. This script will then produce a train.txt, validation.txt and test.txt file. To tokenize Finnish data, for example, run:

$ tokenize_wiki_40b --language fi --tgt-dir output/fi/ --break-text-mode document

To tokenize a previously downloaded file, run:

$ tokenize_wiki_file --language fi --src-fname <src_fname> --tgt-fname output/fi/wiki.txt

Finally, to fallback to using multilingual tokenizer / sentencizer models (instead of language specific ones), pass the flag --allow-multilingual when calling these scripts.

Development setup

Create a conda enviroment:

$ conda env create -f environment.yml

Then install the lib in editable mode:

$ pip install --editable .

About

Code to download and tokenize wikipedia data.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published