In [4]:
import sys; sys.path += ['..', '../src']

Here we should preprocess Dostoevsky and News corpora:
* split each one into individual sentences;
* tokenize each one;
* calculate, how much dictionaries diverge (I really hope that they are almost the same)
* learn joint BPEs
* apply BPE
* learn strong embeddings for these BPEs on a joint shuffled corpora

Let's split sentences in Dostoevsky (there can be very long sentences)

In [24]:
# DISCLAIMER
# We run this command in shell, not in jupyter notebook,
# because it does not show intermediate outputs

# %%bash

# mosesdecoder="../ext-libs/mosesdecoder"
# mkdir -p "../data/generated/classics"

# for file in $(ls ../data/classics); do
# echo "$file -> ${file::-4}.split.txt" && \
# $mosesdecoder/scripts/ems/support/split-sentences.perl -l ru -threads 20 \
#     < "../data/classics/$file" > "../data/generated/classics/${file::-4}.split.txt"
# done

Sentence Splitter v3
Language: ru


Let's tokenize our datasets

In [1]:
!cat ../data/generated/classics/*.txt >> ../data/generated/classics.txt

In [2]:
%%bash

mosesdecoder="../ext-libs/mosesdecoder"
data_dir="../data"
generated_data_dir="$data_dir/generated"

threads=20

cat "$generated_data_dir/classics.txt" | \
    $mosesdecoder/scripts/tokenizer/normalize-punctuation.perl -l ru | \
    $mosesdecoder/scripts/tokenizer/tokenizer.perl -threads $threads -l ru > \
    $generated_data_dir/classics.tok
    
cat "$data_dir/news.2016.ru.shuffled" | \
    $mosesdecoder/scripts/tokenizer/normalize-punctuation.perl -l ru | \
    $mosesdecoder/scripts/tokenizer/tokenizer.perl -threads $threads -l ru > \
    $generated_data_dir/news.ru.tok

Tokenizer Version 1.1
Language: ru
Number of threads: 20


Let's now see, how much tokens we share in the dictionary

In [5]:
from src.vocab import Vocab

news = open('../data/generated/news.ru.tok', encoding='utf-8').read().splitlines()
classics = open('../data/generated/classics.tok', encoding='utf-8').read().splitlines()

vocab_news = Vocab.from_sequences(news)
vocab_classics = Vocab.from_sequences(classics)

vocab_news = set(vocab_news.token2id.keys())
vocab_classics = set(vocab_classics.token2id.keys())

print('Size of news vocabulary', len(vocab_news))
print('Size of classics vocabulary', len(vocab_classics))
print('How much tokens intersect?', len(vocab_news.intersection(vocab_classics)))

Size of news vocabulary 1326362
Size of classics vocabulary 873226
How much tokens intersect? 374421


Now we can learn and apply BPEs.

In [8]:
%%bash

subword_nmt="../ext-libs/subword-nmt"
data_dir="../data"
generated_data_dir="$data_dir/generated"
data_src="$generated_data_dir/news.ru.tok"
data_trg="$generated_data_dir/classics.tok"

# We purposely set such low amount of BPEs
# so our model is more like char-rnn
num_bpes=1000

bpes="$generated_data_dir/news-classics.bpes"
vocab_src="$generated_data_dir/classics.vocab"
vocab_trg="$generated_data_dir/news.ru.vocab"

# Learning BPEs
python "$subword_nmt/learn_joint_bpe_and_vocab.py" --input $data_src $data_trg \
    -s $num_bpes -o $bpes --write-vocabulary $vocab_src $vocab_trg

# Let's apply bpe here for our tokenized files
python "$subword_nmt/apply_bpe.py" -c $bpes < $data_src > $data_src.bpe
python "$subword_nmt/apply_bpe.py" -c $bpes < $data_trg > $data_trg.bpe