In [1]:
import sys; sys.path += ['..', '../src']

Here we should preprocess Dostoevsky and News corpora:
* split each one into individual sentences;
* tokenize each one;
* calculate, how much dictionaries diverge (I really hope that they are almost the same)
* learn joint BPEs
* apply BPE
* learn strong embeddings for these BPEs on a joint shuffled corpora

First, let's joint all Dostoevsky books into single corpus

In [4]:
import os
from os import path

books = []

for bookname in os.listdir('../data/dostoevsky'):
    book = open(path.join('../data/dostoevsky', bookname), 'r', encoding='utf-8').read().splitlines()
    book = [line for line in book if len(line) > 0]
    books.append(book)
    
data = [s for book in books for s in book]

with open('../data/generated/dostoevsky_joined.txt', 'w', encoding='utf-8') as f:
    for line in data:
        f.write(line + '\n')

Now, let's split sentences in Dostoevsky (there can be very long sentences)

In [5]:
%%bash

dostoevsky_joined="../data/generated/dostoevsky_joined.txt"
dostoevsky_sent_split="../data/generated/dostoevsky_sent_split.txt"
mosesdecoder="../ext-libs/mosesdecoder"

$mosesdecoder/scripts/ems/support/split-sentences.perl -l ru \
    < $dostoevsky_joined > $dostoevsky_sent_split

# Let's compute, how much more sentences we got
wc -l ../data/generated/dostoevsky_joined.txt
wc -l ../data/generated/dostoevsky_sent_split.txt

Sentence Splitter v3
Language: ru
bash: line 10: !wc: command not found
bash: line 11: !wc: command not found


Let's tokenize our datasets

In [7]:
%%bash

mosesdecoder="../ext-libs/mosesdecoder"
data_dir="../data"
generated_data_dir="$data_dir/generated"

threads=6

cat "$generated_data_dir/dostoevsky_sent_split.txt" | \
    $mosesdecoder/scripts/tokenizer/normalize-punctuation.perl -l ru | \
    $mosesdecoder/scripts/tokenizer/tokenizer.perl -threads $threads -l ru > \
    $generated_data_dir/dostoevsky.tok
    
cat "$data_dir/news.2016.ru.shuffled" | \
    $mosesdecoder/scripts/tokenizer/normalize-punctuation.perl -l ru | \
    $mosesdecoder/scripts/tokenizer/tokenizer.perl -threads $threads -l ru > \
    $generated_data_dir/news.ru.tok

Tokenizer Version 1.1
Language: ru
Number of threads: 6
Tokenizer Version 1.1
Language: ru
Number of threads: 6


Let's now see, how much tokens we share in the dictionary

In [2]:
from src.vocab import Vocab

news = open('../data/generated/news.ru.tok', encoding='utf-8').read().splitlines()
dostoevsky = open('../data/generated/dostoevsky.tok', encoding='utf-8').read().splitlines()

vocab_news = Vocab.from_sequences(news)
vocab_dostoevsky = Vocab.from_sequences(dostoevsky)

vocab_news = set(vocab_news.token2id.keys())
vocab_dostoevsky = set(vocab_dostoevsky.token2id.keys())

print('Size of news vocabulary', len(vocab_news))
print('Size of Dostoevsky vocabulary', len(vocab_dostoevsky))
print('How much tokens intersect?', len(vocab_news.intersection(vocab_dostoevsky)))

Size of news vocabulary 1326362
Size of Dostoevsky vocabulary 128977
How much tokens intersect? 94632


Before we learn joint BPE, we should first join two corpora

In [3]:
import random

joint = news + dostoevsky
random.shuffle(joint)

with open('../data/generated/dostoevsky-news.tok', 'w', encoding='utf-8') as out_f:
    for line in joint:
        out_f.write(line + '\n')

Argh, now we can learn and apply BPEs.

In [4]:
%%bash

subword_nmt="../ext-libs/subword-nmt"
data_dir="../data"
generated_data_dir="$data_dir/generated"
dataset="$generated_data_dir/dostoevsky-news.tok"
num_bpes=10000

# bpes="../data/generated/bpes"
bpes="$generated_data_dir/dostoevsky-news.bpes"
vocab="$generated_data_dir/dostoevsky-news.vocab"
python "$subword_nmt/learn_bpe.py" -s $num_bpes < $dataset > $bpes

# Let's apply bpe here for our tokenized files
for file in "news.ru.tok" "dostoevsky.tok" "dostoevsky-news.tok"
do
    python "$subword_nmt/apply_bpe.py" -c $bpes \
        < $generated_data_dir/$file > $generated_data_dir/$file.bpe
done

# And finally, we should generate vocab
python "$subword_nmt/get_vocab.py" < $bpes > $vocab

Well, actually it's not very correct to learn embeddings this way, because we have 7m (1.4G) lines of News corpora and only 100k (20mb) lines of Dostoevsky. But, let's try to do it this way anyway.

In [5]:
import fasttext

num_threads = 6

model_trg = fasttext.skipgram('../data/generated/dostoevsky-news.tok.bpe',
                              '../trained_models/dostoevsky-news.tok.bpe.skipgram',
                              dim=512, min_count=1, silent=0, thread=num_threads)