In [1]:
import sys; sys.path += ['..', '../src']

Here we should preprocess Classics and News corpora:
* split each one into individual sentences;
* tokenize each one and replace named entities with special tokens;
* calculate, how much dictionaries diverge (I really hope that they are almost the same)
* learn joint BPEs
* apply learnt BPEs

In [6]:
!cat ../data/classics/*.txt >> ../data/generated/classics.txt

**DISCLAIMER**. We run the following cell in shell, not in jupyter notebook, because somehow it hangs here :|

In [None]:
# %%bash

# num_threads=20

# ../ext-libs/mosesdecoder/scripts/ems/support/split-sentences.perl -l ru -threads $num_threads \
#     < "../data/generated/classics.txt" > "../data/generated/classics.split"

Ok, let's tokenize now

In [22]:
%%bash

mosesdecoder="../ext-libs/mosesdecoder"
data_dir="../data"
generated_data_dir="$data_dir/generated"

threads=10

cat "$generated_data_dir/classics.split" | \
    #$mosesdecoder/scripts/tokenizer/normalize-punctuation.perl -l ru | \
    $mosesdecoder/scripts/tokenizer/tokenizer.perl -threads $threads -l ru > \
    $generated_data_dir/classics.tok
    
cat "$data_dir/news/news.2016.ru.shuffled" | \
    #$mosesdecoder/scripts/tokenizer/normalize-punctuation.perl -l ru | \
    $mosesdecoder/scripts/tokenizer/tokenizer.perl -threads $threads -l ru > \
    $generated_data_dir/news.ru.tok

Tokenizer Version 1.1
Language: ru
Number of threads: 10
Tokenizer Version 1.1
Language: ru
Number of threads: 10


Ok, let's extract and replace named entities now.

**DISCLAIMER.** Well, currently we cant do it because of bugs in deepmit/ner and because of low quality of nltk NER package on russian text. So, let's skip this for now...

In [1]:
# import ner
# from tqdm import tqdm; tqdm.monitor_interval = 0

# extractor = ner.Extractor(model_url='http://lnsigo.mipt.ru/export/models/ner/ner_model_total_rus.tar.gz')

# min_len = 20
# max_len = 150
# classics = open('../data/generated/classics.tok', encoding='utf-8').read().splitlines()
# news = open('../data/generated/news.ru.tok', encoding='utf-8').read().splitlines()
# classics = [s for s in classics if min_len < len(s.split()) < max_len]
# news = [s for s in news if min_len < len(s.split()) < max_len]

# # We have an awkward bug in ner, which fails on strings like 'Ивано-вичу'
# # https://github.com/deepmipt/ner/issues/9
# classics = [s.replace('-вичу', 'вичу') for s in classics]

# def replace_nes(corpus):
#     for i,s in enumerate(corpus):
#         for m in reversed(list(extractor(s))):
#             s = s[:m.span.start] + '__NE_' + m.type + '__' + s[m.span.end:]

#         corpus[i] = s
        
#         # tqdm hangs out the page, so let's print info manually
#         if (i+1) % 100000 == 0:
#             print('Steps done: {}/{}'.format(i+1, len(corpus)))

# replace_nes(classics)
# replace_nes(news)

Let's save results

In [2]:
# with open('../data/generated/classics.ner', 'w', encoding='utf-8') as out_f:
#     for line in classics:
#         out_f.write(line + '\n')
        
# with open('../data/generated/news.ru.ner', 'w', encoding='utf-8') as out_f:
#     for line in news:
#         out_f.write(line + '\n')

Let's now see, how much tokens we share in the dictionary

In [2]:
from src.vocab import Vocab

news = open('../data/generated/news.ru.tok', encoding='utf-8').read().splitlines()
classics = open('../data/generated/classics.tok', encoding='utf-8').read().splitlines()

vocab_news = Vocab.from_sequences(news)
vocab_classics = Vocab.from_sequences(classics)

vocab_news = set(vocab_news.token2id.keys())
vocab_classics = set(vocab_classics.token2id.keys())

print('Size of news vocabulary', len(vocab_news))
print('Size of classics vocabulary', len(vocab_classics))
print('How much tokens intersect?', len(vocab_news.intersection(vocab_classics)))

Size of news vocabulary 1326568
Size of classics vocabulary 871402
How much tokens intersect? 374130


Now we can learn and apply BPEs.

In [3]:
%%bash

subword_nmt="../ext-libs/subword-nmt"
data_dir="../data"
generated_data_dir="$data_dir/generated"
data_src="$generated_data_dir/news.ru.tok"
data_trg="$generated_data_dir/classics.tok"

# We purposely set such low amount of BPEs
# so our model is more like char-rnn
num_bpes=1000

bpes="$generated_data_dir/news-classics.bpes"
vocab_src="$generated_data_dir/classics.vocab"
vocab_trg="$generated_data_dir/news.ru.vocab"

# Learning BPEs
python "$subword_nmt/learn_joint_bpe_and_vocab.py" --input $data_src $data_trg \
    -s $num_bpes -o $bpes --write-vocabulary $vocab_src $vocab_trg

# Let's apply bpe here for our tokenized files
python "$subword_nmt/apply_bpe.py" -c $bpes < $data_src > $data_src.bpe
python "$subword_nmt/apply_bpe.py" -c $bpes < $data_trg > $data_trg.bpe