In [6]:
import sys; sys.path += ['..', '../src']

Here we should preprocess Dostoevsky and News corpora:
* split each one into individual sentences;
* tokenize each one;
* calculate, how much dictionaries diverge (I really hope that they are almost the same)
* learn joint BPEs
* apply BPE
* learn strong embeddings for these BPEs on a joint shuffled corpora

Let's split sentences in Dostoevsky (there can be very long sentences)

In [24]:
# DISCLAIMER
# We run this command in shell, not in jupyter notebook,
# because it does not show intermediate outputs

# %%bash

# mosesdecoder="../ext-libs/mosesdecoder"
# mkdir -p "../data/generated/classic-books"

# for file in $(ls ../data/classic-books); do

# echo "$file -> ${file::-4}.split.txt" && \
# $mosesdecoder/scripts/ems/support/split-sentences.perl -l ru -threads 20 \
#     < ../data/classic-books/$file > ../data/generated/classic-books/${file::-4}.split.txt

# done

Sentence Splitter v3
Language: ru


Let's tokenize our datasets

In [25]:
!cat ../data/generated/classic-books/*.txt >> ../data/generated/classics.txt

In [26]:
%%bash

mosesdecoder="../ext-libs/mosesdecoder"
data_dir="../data"
generated_data_dir="$data_dir/generated"

threads=20

cat "$generated_data_dir/classics.txt" | \
    $mosesdecoder/scripts/tokenizer/normalize-punctuation.perl -l ru | \
    $mosesdecoder/scripts/tokenizer/tokenizer.perl -threads $threads -l ru > \
    $generated_data_dir/classics.tok
    
cat "$data_dir/news.2016.ru.shuffled" | \
    $mosesdecoder/scripts/tokenizer/normalize-punctuation.perl -l ru | \
    $mosesdecoder/scripts/tokenizer/tokenizer.perl -threads $threads -l ru > \
    $generated_data_dir/news.ru.tok

Tokenizer Version 1.1
Language: ru
Number of threads: 20


Let's now see, how much tokens we share in the dictionary

In [28]:
from src.vocab import Vocab

news = open('../data/generated/news.ru.tok', encoding='utf-8').read().splitlines()
classics = open('../data/generated/classics.tok', encoding='utf-8').read().splitlines()

vocab_news = Vocab.from_sequences(news)
vocab_classics = Vocab.from_sequences(classics)

vocab_news = set(vocab_news.token2id.keys())
vocab_classics = set(vocab_classics.token2id.keys())

print('Size of news vocabulary', len(vocab_news))
print('Size of classics vocabulary', len(vocab_classics))
print('How much tokens intersect?', len(vocab_news.intersection(vocab_classics)))

Size of news vocabulary 1326362
Size of classics vocabulary 778191
How much tokens intersect? 360441


Now we can learn and apply BPEs.

In [29]:
%%bash

subword_nmt="../ext-libs/subword-nmt"
data_dir="../data"
generated_data_dir="$data_dir/generated"
data_src="$generated_data_dir/news.ru.tok"
data_trg="$generated_data_dir/classics.tok"
num_bpes_src=4000
num_bpes_trg=4000

bpes_src="$generated_data_dir/news.ru.bpes"
bpes_trg="$generated_data_dir/classics.bpes"

python "$subword_nmt/learn_bpe.py" -s $num_bpes_src < $data_src > $bpes_src
python "$subword_nmt/learn_bpe.py" -s $num_bpes_trg < $data_trg > $bpes_trg

# Let's apply bpe here for our tokenized files
python "$subword_nmt/apply_bpe.py" -c $bpes_src < $data_src > $data_src.bpe
python "$subword_nmt/apply_bpe.py" -c $bpes_trg < $data_trg > $data_trg.bpe

vocab_src="$generated_data_dir/news.ru.vocab"
vocab_trg="$generated_data_dir/classics.vocab"

# And finally, we should generate vocab
python "$subword_nmt/get_vocab.py" < $bpes_src > $vocab_src
python "$subword_nmt/get_vocab.py" < $bpes_trg > $vocab_trg

Well, actually it's not very correct to learn embeddings this way, because we have 7m (1.4G) lines of News corpora and only 100k (20mb) lines of Dostoevsky. But, let's try to do it this way anyway.

In [31]:
import fasttext

num_threads = 20
dim = 512

model_src = fasttext.skipgram('../data/generated/news.ru.tok.bpe',
                              '../trained_models/news.ru.tok.bpe.skipgram',
                              dim=dim, min_count=1, silent=0, thread=num_threads)

model_trg = fasttext.skipgram('../data/generated/classics.tok.bpe',
                              '../trained_models/classics.tok.bpe.skipgram',
                              dim=dim, min_count=1, silent=0, thread=num_threads)

model_trg_tok = fasttext.skipgram('../data/generated/classics.tok',
                                  '../trained_models/classics.tok.skipgram',
                                  dim=dim, min_count=5, silent=0, thread=num_threads)