Code to download and tokenize wikipedia data.
You can install wikitokenizer directly from PyPI:
pip install wikitokenizer
Or from source:
git clone https://github.com/tpimentelms/wikitokenizer.git
cd wikitokenizer
pip install --editable .
Wiki tokenizer has the following main requirements:
To download and tokenize wikipedia data for a specific language in Wiki40B:
$ tokenize_wiki_40b --language <wikipedia_language_code> --tgt-dir <tgt_dir> --break-text-mode <break_text_mode>
Where <wikipedia_language_code>
is the language code in wikipedia for the desired language, <tgt_dir>
is the directory where data should be saved, and <break_text_mode>
is either 'document', paragraph
or sentence
. This script will then produce a train.txt
, validation.txt
and test.txt
file. To tokenize Finnish data, for example, run:
$ tokenize_wiki_40b --language fi --tgt-dir output/fi/ --break-text-mode document
To tokenize a previously downloaded file, run:
$ tokenize_wiki_file --language fi --src-fname <src_fname> --tgt-fname output/fi/wiki.txt
Finally, to fallback to using multilingual tokenizer / sentencizer models (instead of language specific ones), pass the flag --allow-multilingual
when calling these scripts.
Create a conda enviroment:
$ conda env create -f environment.yml
Then install the lib in editable mode:
$ pip install --editable .