Skip to content

Commit

Permalink
Simplify codes
Browse files Browse the repository at this point in the history
  • Loading branch information
singletongue committed Jun 5, 2019
1 parent 3ed7d97 commit 282d8ab
Show file tree
Hide file tree
Showing 6 changed files with 297 additions and 310 deletions.
98 changes: 44 additions & 54 deletions README.md
Expand Up @@ -26,98 +26,88 @@ For `entity_vectors.txt`, white spaces within names of NEs are replaced with und

Pre-trained vectors are trained under the configurations below (see Manual training for details):

### `generate_corpus.py`
### `make_corpus.py`

#### Japanese

We used [mecab-ipadic-NEologd](https://github.com/neologd/mecab-ipadic-neologd) (v0.0.6) for tokenizing Japanese texts.

|Option |Value |
|:------------|:------|
|`--tokenizer`|`mecab`|
|`--mecab_dic`|path of the installed NEologd dictionary|

<!-- #### English
|Option |Value |
|:------------|:----------|
|`--tokenizer`|`regexp` |
|`--lower` |(specified)| -->
|Option |Value |
|:-------------------|:-----------------------------------------------------|
|`--cirrus_file` |path to `jawiki-20190520-cirrussearch-content.json.gz`|
|`--output_file` |path to the output file |
|`--tokenizer` |`mecab` |
|`--tokenizer_option`|`-d <directory of mecab-ipadic-NEologd dictionary>` |

### `train.py`

|Option |Value |
|:------------|:----------------------|
|`--size` |`100`, `200`, and `300`|
|`--window` |`10` |
|`--mode` |`sg` |
|`--loss` |`ns` |
|`--sample` |`1e-3` |
|`--negative` |`10` |
|`--threads` |`12` |
|`--iter` |`5` |
|`--min_count`|`10` |
|`--alpha` |`0.025` |
|Option |Value |
|:--------------|:-----------------------------------------------|
|`--corpus_file`|path to a corpus file made with `make_corpus.py`|
|`--output_dir` |path to the output directory |
|`--embed_size` |`100`, `200`, or `300` |
|`--window_size`|`10` |
|`--sample_size`|`10` |
|`--min_count` |`10` |
|`--epoch` |`5` |
|`--workers` |`20` |


## Manual training

You can manually process Wikipedia dump file and train the CBOW or skip-gram model on the preprocessed file.
You can manually process Wikipedia dump file and train a skip-gram model on the preprocessed file.


### Requirements

- Python 3.6
- gensim
- MeCab and its Python binding (mecab-python3) (optional: required for tokenizing text in Japanese)
- logzero
- MeCab and its Python binding (mecab-python3) (optional: required for tokenizing Japanese texts)


### Steps

1. Download Wikipedia Cirrussearch dump file from [here](https://dumps.wikimedia.org/other/cirrussearch/).
- Make sure to choose a file with a name like `**wiki-YYYYMMDD-cirrussearch-content.json.gz`.
- Make sure to choose a file named like `**wiki-YYYYMMDD-cirrussearch-content.json.gz`.
2. Clone this repository.
3. Preprocess the downloaded dump file.
```
$ python generate_corpus.py path/to/dump/file.json.gz path/to/output/corpus/file.txt.bz2
$ python make_corpus.py --cirrus_file <dump file> --output_file <corpus file>
```
If you're processing Japanese version of Wikipedia, make sure to use MeCab tokenizer by setting `--tokenizer mecab` option.
Otherwise, the text will be tokenized by a simple rule based on regular expression.
4. Train the model
```
$ python train.py path/to/corpus/file.txt.bz2 path/to/output/directory/ --size 100 --window 10 --mode sg --loss ns --sample 1e-3 --negative 10 --threads 20 --iter 5 --min_count 10 --alpha 0.025
$ python train.py --corpus_file <corpus file> --output_dir <output directory>
```

You can configure options below for training a model.

```
usage: train.py [-h] [--size SIZE] [--window WINDOW] [--mode {cbow,sg}]
[--loss {ns,hs}] [--sample SAMPLE] [--negative NEGATIVE]
[--threads THREADS] [--iter ITER] [--min_count MIN_COUNT]
[--alpha ALPHA]
corpus_file out_dir
positional arguments:
corpus_file corpus file
out_dir output directory to store embedding files
usage: train.py [-h] --corpus_file CORPUS_FILE --output_dir OUTPUT_DIR
[--embed_size EMBED_SIZE] [--window_size WINDOW_SIZE]
[--sample_size SAMPLE_SIZE] [--min_count MIN_COUNT]
[--epoch EPOCH] [--workers WORKERS]
optional arguments:
-h, --help show this help message and exit
--size SIZE size of word vectors [100]
--window WINDOW maximum window size [5]
--mode {cbow,sg} training algorithm: "cbow" (continuous bag of words)
or "sg" (skip-gram) [cbow]
--loss {ns,hs} loss function: "ns" (negative sampling) or "hs"
(hierarchical softmax) [ns]
--sample SAMPLE threshold of frequency of words to be down-sampled
[1e-3]
--negative NEGATIVE number of negative examples [5]
--threads THREADS number of worker threads to use for training [2]
--iter ITER number of iterations in training [5]
--min_count MIN_COUNT
discard all words with total frequency lower than this
[5]
--alpha ALPHA initial learning rate [0.025]
-h, --help show this help message and exit
--corpus_file CORPUS_FILE
Corpus file (.txt)
--output_dir OUTPUT_DIR
Output directory to save embedding files
--embed_size EMBED_SIZE
Dimensionality of the word/entity vectors [100]
--window_size WINDOW_SIZE
Maximum distance between the current and predicted
word within a sentence [5]
--sample_size SAMPLE_SIZE
Number of negative samples [5]
--min_count MIN_COUNT
Ignores all words/entities with total frequency lower
than this [5]
--epoch EPOCH number of training epochs [5]
--workers WORKERS Use these many worker threads to train the model [2]
```


Expand Down
109 changes: 0 additions & 109 deletions generate_corpus.py

This file was deleted.

109 changes: 109 additions & 0 deletions make_corpus.py
@@ -0,0 +1,109 @@
import re
import json
import gzip
import argparse
from collections import OrderedDict

from logzero import logger

from tokenization import RegExpTokenizer, NLTKTokenizer, MeCabTokenizer


regex_spaces = re.compile(r'\s+')
regex_hyperlink = re.compile(r'\[\[([^:]+?)\]\]')
regex_entity = re.compile(r'##[^#]+?##')


def main(args):
logger.info('initializing a tokenizer')
if args.tokenizer == 'regexp':
tokenizer = RegExpTokenizer(do_lower_case=args.do_lower_case,
preserved_pattern=regex_entity)
elif args.tokenizer == 'nltk':
tokenizer = NLTKTokenizer(do_lower_case=args.do_lower_case,
preserved_pattern=regex_entity)
elif args.tokenizer == 'mecab':
tokenizer = MeCabTokenizer(mecab_option=args.tokenizer_option,
do_lower_case=args.do_lower_case,
preserved_pattern=regex_entity)
else:
raise RuntimeError(f'Invalid tokenizer: {args.tokenizer}')

logger.info('generating corpus for training')
n_processed = 0
with gzip.open(args.cirrus_file, 'rt') as fi, \
open(args.output_file, 'wt') as fo:
for line in fi:
json_item = json.loads(line)
if 'title' not in json_item:
continue

title = json_item['title']
text = regex_spaces.sub(' ', json_item['text'])

hyperlinks = dict()
hyperlinks[title] = title
for match in regex_hyperlink.finditer(json_item['source_text']):
if '|' in match.group(1):
(entity, anchor) = match.group(1).split('|', maxsplit=1)
else:
entity = anchor = match.group(1)

if '#' in entity:
entity = entity[:entity.find('#')]

anchor = anchor.strip()
entity = entity.strip()
if len(anchor) > 0 and len(entity) > 0:
hyperlinks.setdefault(anchor, entity)

hyperlinks_sorted = OrderedDict(sorted(
hyperlinks.items(), key=lambda t: len(t[0]), reverse=True))

replacement_flags = [0] * len(text)
for (anchor, entity) in hyperlinks_sorted.items():
cursor = 0
while cursor < len(text) and anchor in text[cursor:]:
start = text.index(anchor, cursor)
end = start + len(anchor)
if not any(replacement_flags[start:end]):
entity_token = f'##{entity}##'.replace(' ', '_')
text = text[:start] + entity_token + text[end:]
replacement_flags = replacement_flags[:start] \
+ [1] * len(entity_token) + replacement_flags[end:]
assert len(text) == len(replacement_flags)
cursor = start + len(entity_token)
else:
cursor = end

text = ' '.join(tokenizer.tokenize(text))

print(text, file=fo)
n_processed += 1

if n_processed <= 10:
logger.info('*** Example ***')
example_text = text[:400] + '...' if len(text) > 400 else text
logger.info(example_text)

if n_processed % 10000 == 0:
logger.info(f'processed: {n_processed}')

if n_processed % 10000 != 0:
logger.info(f'processed: {n_processed}')


if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('--cirrus_file', type=str, required=True,
help='Wikipedia Cirrussearch content dump file (.json.gz)')
parser.add_argument('--output_file', type=str, required=True,
help='output corpus file (.txt)')
parser.add_argument('--tokenizer', default='regexp',
help='tokenizer type [regexp]')
parser.add_argument('--do_lower_case', action='store_true',
help='lowercase words (not applied to NEs)')
parser.add_argument('--tokenizer_option', type=str, default='',
help='option string passed to the tokenizer')
args = parser.parse_args()
main(args)

0 comments on commit 282d8ab

Please sign in to comment.