# Ukrainian Stories For Kids Generation Based On Multilingual Bert

The goal of this final project was to train multilingual Bert from Google on Ukrainian corpus to compare original model with trained version on Masked Language Model and Next Sentence Prediction combined to see how good the original model was and if some improvement could have been made.

One of the biggest challenges that was faced in this project was to find a suitable dataset. Since Ukrainian corpuses are not widespread it was necessary to create one. The initial guess was that although BERT is claiming to be multilingual, it was not performing well on low-resource languages like Ukrainian. The assumption proved itself to be true as you will be able to see later. Short stories for kids and fairytales are a good candidates for training corpus since they are comprised of not so big of a voabulary and generally have similar narration structure. It's important to mention that vocabulary of a child is not as developed as that of an adult, so the model might do a much better job training on it.

In [1]:
!pip install pytorch-pretrained-bert pytorch-nlp
!pip install tokenize_uk
!pip install pytorch_transformers

Collecting pytorch-pretrained-bert
[?25l  Downloading https://files.pythonhosted.org/packages/d7/e0/c08d5553b89973d9a240605b9c12404bcf8227590de62bae27acbcfe076b/pytorch_pretrained_bert-0.6.2-py3-none-any.whl (123kB)
[K     |██▋                             | 10kB 13.6MB/s eta 0:00:01[K     |█████▎                          | 20kB 4.4MB/s eta 0:00:01[K     |████████                        | 30kB 6.2MB/s eta 0:00:01[K     |██████████▋                     | 40kB 4.0MB/s eta 0:00:01[K     |█████████████▎                  | 51kB 4.9MB/s eta 0:00:01[K     |███████████████▉                | 61kB 5.8MB/s eta 0:00:01[K     |██████████████████▌             | 71kB 6.6MB/s eta 0:00:01[K     |█████████████████████▏          | 81kB 7.4MB/s eta 0:00:01[K     |███████████████████████▉        | 92kB 8.2MB/s eta 0:00:01[K     |██████████████████████████▌     | 102kB 6.4MB/s eta 0:00:01[K     |█████████████████████████████▏  | 112kB 6.4MB/s eta 0:00:01[K     |██████████████████████

In [0]:
import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from pytorch_pretrained_bert import BertTokenizer, BertConfig
from pytorch_pretrained_bert import BertAdam, BertForSequenceClassification, BertForNextSentencePrediction, BertForPreTraining
from tqdm import tqdm, trange
import pandas as pd
import io
import numpy as np
import matplotlib.pyplot as plt
import torch
import tokenize_uk
import os
% matplotlib inline

In [0]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [0]:
import tensorflow as tf

device_name = tf.test.gpu_device_name()
if device_name != "/device:GPU:0":
  raise SystemError("GPU Device not found")
print("Found GPU at: {}".format(device_name))

Found GPU at: /device:GPU:0


In [0]:
device = torch.device("cuda" if torch.cuda.is_available else "cpu")
# device = "cpu"
n_gpu = torch.cuda.device_count()
torch.cuda.get_device_name(0)

'Tesla K80'

# Dataset

The dataset for this project was comprised of different fairytales and stories for kids. The data was downloaded from different websites and resulted in nearly 750 documents with 93 00 sentences for training

In [0]:
# Loads texts given root path
def load_texts(root):
  texts = []
  for r, directories, files in os.walk(root_path):
    for d in directories:
      print("Directory: ", d)
      for root, dirs, f in os.walk(root_path+d):
        for filename in f:
          print("File: ", filename)
          # number_of_files += 1
          with open(root_path+d+"/"+filename, 'r') as file:
            data = file.read()
            texts.append(data)
  return texts

In [0]:
root_path = "/content/drive/My Drive/Fairytales/train/"
texts = load_texts(root_path)

Directory:  К
File:  Конотопська відьма_8.txt
File:  Красносвіт.txt
File:  Конотопська відьма_9.txt
File:  Ківш лиха.txt
File:  Калиточка.txt
File:  Кравець та вовк.txt
File:  Кіт і пес.txt
File:  Кіт, кріт, курочка та лисиця.txt
File:  Кобиляча голова.txt
File:  Кирик.txt
File:  Крижане серце.txt
File:  Карликова сопілка.txt
File:  Кіт, цап і баран.txt
File:  Коваль.txt
File:  Козак Мамарига.txt
File:  Кінська сила.txt
File:  Круглячок.txt
File:  Конотопська відьма_12.txt
File:  Конотопська відьма_11.txt
File:  Кривенька качечка.txt
File:  Конотопська відьма_13.txt
File:  Колобок.txt
File:  Конотопська відьма_10.txt
File:  Казка про Женчика.txt
File:  Котофей і пан Печерецький.txt
File:  Калинова сопілка.txt
File:  Кому трудніш правитись.txt
File:  Котигорошко.txt
File:  Коза – дереза.txt
File:  Козаки і смерть.txt
File:  Кабан дикий - хвіст великий_1.txt
File:  Коржик.txt
File:  Кріпак і чорт.txt
File:  Королевич та залізний вовк.txt
File:  Конотопська відьма_14.txt
File:  Конотопськ

In [0]:
print("Number of documents: ", len(texts)) # 743 documents

Number of documents:  743


In [0]:
# Returns unqiue punctuation marks from given texts
def get_unique_delimiters(texts):
  delimiters = set()
  for text in texts:
    for word in tokenize_uk.tokenize_words(text):
      if (len(word) == 1 and not word in delimiters and not word.isalpha() and not word.isdigit()):
        delimiters.add(word)
  return delimiters

In [0]:
unique_delimiters = get_unique_delimiters(texts)
print(unique_delimiters)
print(len(unique_delimiters))

{'«', '<', '|', '-', ',', '"', '№', '_', "'", '&', '—', '[', '“', '?', '―', '‘', '–', '.', ']', '’', ';', ':', '(', ')', '…', '~', '*', '^', '>', '»', '!'}
31


In [0]:
dashes = {'–', '—', '―', '~'} # replace with -
special_symbols = {'№', '_', '<', '>', '|', ']', '*', '[', '^', '&'} # replace with ""
apostrophes = {'’', '‘'} # replace with '
direct_speech = {'“', '»', '«'} # replace with '"'
three_dots = {'…'} # replace with '.'

In [0]:
counter = 0
for i in range(len(texts)):
  print("Processing text: ", i)
  text = texts[i]
  words = []
  tokenized_words = tokenize_uk.tokenize_words(text)
  for word in tokenized_words:
    added = False

    for dash in dashes:
      if (dash in word):
        new_word = word.replace(dash, "-")
        words.append(new_word)
        added = True
        continue

    for special_symbol in special_symbols:
      if(special_symbol in word):
        new_word = word.replace(special_symbol, "")
        words.append(new_word)
        added = True
        continue
    
    for apostrophe in apostrophes:
      if (apostrophe in word):
        new_word = word.replace(apostrophe, "'")
        words.append(new_word)
        added = True
        continue

    for direct in direct_speech:
      if (direct in word):
        new_word = word.replace(direct, '"')
        words.append(new_word)
        added = True
        continue
    
    for dots in three_dots:
      if (dots in word):
        counter += 1
        new_word = word.replace(dots, '.')
        words.append(new_word)
        added = True
        continue
    if (not added):
      words.append(word)
  reconstructed_text = " ".join(words)
  texts[i] = reconstructed_text 

Processing text:  0
Processing text:  1
Processing text:  2
Processing text:  3
Processing text:  4
Processing text:  5
Processing text:  6
Processing text:  7
Processing text:  8
Processing text:  9
Processing text:  10
Processing text:  11
Processing text:  12
Processing text:  13
Processing text:  14
Processing text:  15
Processing text:  16
Processing text:  17
Processing text:  18
Processing text:  19
Processing text:  20
Processing text:  21
Processing text:  22
Processing text:  23
Processing text:  24
Processing text:  25
Processing text:  26
Processing text:  27
Processing text:  28
Processing text:  29
Processing text:  30
Processing text:  31
Processing text:  32
Processing text:  33
Processing text:  34
Processing text:  35
Processing text:  36
Processing text:  37
Processing text:  38
Processing text:  39
Processing text:  40
Processing text:  41
Processing text:  42
Processing text:  43
Processing text:  44
Processing text:  45
Processing text:  46
Processing text:  47
Pr

In [0]:
new_texts = []
for i in range(len(texts)):
  text = texts[i]
  text = text.replace("?", "?.")
  text = text.replace("!", "!.")
  text = text.replace(":", ":.")
  text = text.replace(". -", ". ")
  new_texts.append(text)
print("Number of texts: ", len(new_texts))

Number of texts:  743


In [0]:
sentences = []
for text in new_texts:
  sentences += tokenize_uk.tokenize_sents(text)
  sentences += "\n"
print("Number of sentences: ", len(sentences))

Number of sentences:  93728


In [0]:
for i in range(len(sentences)):
  sentence = sentences[i]
  sentence = sentence.replace("?.", "?")
  sentence = sentence.replace("!.", "!")
  sentence = sentence.replace(":.", ":")
  sentences[i] = sentence

In [0]:
# for sentence in sentences:
#   print(sente

In [0]:
len(sentences) # 93728
# Make it 100 000 ?

93728

In [0]:
longest = len(sentences[0])
for i in range(len(sentences)):
  sent = sentences[i]
  if (len(sent) > longest):
    longest = len(sent)
    print(longest)
  if (len(sent) > 512):
    sentences[i] = sent[:500]

329
434
500
507
510


In [0]:
sentences[:100]

['Смутний i невеселий стояв , руки заложивши , хваброї Конотопської сотнi пан сотник , Уласович Микита Забрьоха , у славному сотенному мiстечку Конотопi , на вулицi , бiля шинку , де усегда збиралася сотня чи на муштру , чи на перелiку , що чи не втiк котрий козак часом , бува .',
 'Стоїть вiн , сердека , руки зложивши , голову понуривши , мов вiл перед ярмом ; а козаки начисто , уся сотня , як скло , перед ним стiною стоїть , шапки поскладавши на приспi у шинку , щоб як буде муштра , так щоб не поспадали з голов , а дiтвора , що тут так i бiга круг козацтва , щоб не пiдiбрали та не запроторили куди геть .',
 'Так отто стоять козаки i ждуть , що з ними будуть робити i який приказ буде , та промеж себе дещо i базiкають , мов вода на лотоках шумить , аж луна йде ; та доставши з халяв хто рiжок з кабакою , - та нюхають , та чхають , а хто люльку - та , тут її розпаливши , i смокче .',
 'Пан Забрьоха сього нiчого не вважа , i не бачить , i не чує , що край його дiється .',
 'Йому здається 

In [0]:
print("Corpus length: ", len(sentences))

Corpus length:  93728


In [0]:
text_file = open("/content/drive/My Drive/BERT/corpus.txt", "w")
for sentence in sentences:
  if (sentence != "\n"):
    text_file.write(sentence+ "\n")
  else:
    text_file.write("\n")
text_file.close()

# Pregenerate Training Data

Pregenerating training data was based on an example from PyTorch-Transformers Github. Training data was generated for 4 epochs since BERT authors suggested 2-4 to be he optimal number of epochs.

In [0]:
from argparse import ArgumentParser
from pathlib import Path
from tqdm import tqdm, trange
from tempfile import TemporaryDirectory
import shelve
from multiprocessing import Pool

from random import random, randrange, randint, shuffle, choice
from pytorch_transformers.tokenization_bert import BertTokenizer
import numpy as np
import json
import collections

In [0]:
class DocumentDatabase:
    def __init__(self, reduce_memory=False):
        if reduce_memory:
            self.temp_dir = TemporaryDirectory()
            self.working_dir = Path(self.temp_dir.name)
            self.document_shelf_filepath = self.working_dir / 'shelf.db'
            self.document_shelf = shelve.open(str(self.document_shelf_filepath),
                                              flag='n', protocol=-1)
            self.documents = None
        else:
            self.documents = []
            self.document_shelf = None
            self.document_shelf_filepath = None
            self.temp_dir = None
        self.doc_lengths = []
        self.doc_cumsum = None
        self.cumsum_max = None
        self.reduce_memory = reduce_memory

    def add_document(self, document):
        if not document:
            return
        if self.reduce_memory:
            current_idx = len(self.doc_lengths)
            self.document_shelf[str(current_idx)] = document
        else:
            self.documents.append(document)
        self.doc_lengths.append(len(document))

    def _precalculate_doc_weights(self):
        self.doc_cumsum = np.cumsum(self.doc_lengths)
        self.cumsum_max = self.doc_cumsum[-1]

    def sample_doc(self, current_idx, sentence_weighted=True):
        # Uses the current iteration counter to ensure we don't sample the same doc twice
        if sentence_weighted:
            # With sentence weighting, we sample docs proportionally to their sentence length
            if self.doc_cumsum is None or len(self.doc_cumsum) != len(self.doc_lengths):
                self._precalculate_doc_weights()
            rand_start = self.doc_cumsum[current_idx]
            rand_end = rand_start + self.cumsum_max - self.doc_lengths[current_idx]
            sentence_index = randrange(rand_start, rand_end) % self.cumsum_max
            sampled_doc_index = np.searchsorted(self.doc_cumsum, sentence_index, side='right')
        else:
            # If we don't use sentence weighting, then every doc has an equal chance to be chosen
            sampled_doc_index = (current_idx + randrange(1, len(self.doc_lengths))) % len(self.doc_lengths)
        assert sampled_doc_index != current_idx
        if self.reduce_memory:
            return self.document_shelf[str(sampled_doc_index)]
        else:
            return self.documents[sampled_doc_index]

    def __len__(self):
        return len(self.doc_lengths)

    def __getitem__(self, item):
        if self.reduce_memory:
            return self.document_shelf[str(item)]
        else:
            return self.documents[item]

    def __enter__(self):
        return self

    def __exit__(self, exc_type, exc_val, traceback):
        if self.document_shelf is not None:
            self.document_shelf.close()
        if self.temp_dir is not None:
            self.temp_dir.cleanup()

In [0]:
def truncate_seq_pair(tokens_a, tokens_b, max_num_tokens):
    """Truncates a pair of sequences to a maximum sequence length. Lifted from Google's BERT repo."""
    while True:
        total_length = len(tokens_a) + len(tokens_b)
        if total_length <= max_num_tokens:
            break

        trunc_tokens = tokens_a if len(tokens_a) > len(tokens_b) else tokens_b
        assert len(trunc_tokens) >= 1

        # We want to sometimes truncate from the front and sometimes from the
        # back to add more randomness and avoid biases.
        if random() < 0.5:
            del trunc_tokens[0]
        else:
            trunc_tokens.pop()

In [0]:
MaskedLmInstance = collections.namedtuple("MaskedLmInstance",
                                          ["index", "label"])

In [0]:
def create_masked_lm_predictions(tokens, masked_lm_prob, max_predictions_per_seq, whole_word_mask, vocab_list):
    """Creates the predictions for the masked LM objective. This is mostly copied from the Google BERT repo, but
    with several refactors to clean it up and remove a lot of unnecessary variables."""
    cand_indices = []
    for (i, token) in enumerate(tokens):
        if token == "[CLS]" or token == "[SEP]":
            continue
        # Whole Word Masking means that if we mask all of the wordpieces
        # corresponding to an original word. When a word has been split into
        # WordPieces, the first token does not have any marker and any subsequence
        # tokens are prefixed with ##. So whenever we see the ## token, we
        # append it to the previous set of word indexes.
        #
        # Note that Whole Word Masking does *not* change the training code
        # at all -- we still predict each WordPiece independently, softmaxed
        # over the entire vocabulary.
        if (whole_word_mask and len(cand_indices) >= 1 and token.startswith("##")):
            cand_indices[-1].append(i)
        else:
            cand_indices.append([i])

    num_to_mask = min(max_predictions_per_seq,
                      max(1, int(round(len(tokens) * masked_lm_prob))))
    shuffle(cand_indices)
    masked_lms = []
    covered_indexes = set()
    for index_set in cand_indices:
        if len(masked_lms) >= num_to_mask:
            break
        # If adding a whole-word mask would exceed the maximum number of
        # predictions, then just skip this candidate.
        if len(masked_lms) + len(index_set) > num_to_mask:
            continue
        is_any_index_covered = False
        for index in index_set:
            if index in covered_indexes:
                is_any_index_covered = True
                break
        if is_any_index_covered:
            continue
        for index in index_set:
            covered_indexes.add(index)

            masked_token = None
            # 80% of the time, replace with [MASK]
            if random() < 0.8:
                masked_token = "[MASK]"
            else:
                # 10% of the time, keep original
                if random() < 0.5:
                    masked_token = tokens[index]
                # 10% of the time, replace with random word
                else:
                    masked_token = choice(vocab_list)
            masked_lms.append(MaskedLmInstance(index=index, label=tokens[index]))
            tokens[index] = masked_token

    assert len(masked_lms) <= num_to_mask
    masked_lms = sorted(masked_lms, key=lambda x: x.index)
    mask_indices = [p.index for p in masked_lms]
    masked_token_labels = [p.label for p in masked_lms]

    return tokens, mask_indices, masked_token_labels

In [0]:
def create_instances_from_document(
        doc_database, doc_idx, max_seq_length, short_seq_prob,
        masked_lm_prob, max_predictions_per_seq, whole_word_mask, vocab_list):
    """This code is mostly a duplicate of the equivalent function from Google BERT's repo.
    However, we make some changes and improvements. Sampling is improved and no longer requires a loop in this function.
    Also, documents are sampled proportionally to the number of sentences they contain, which means each sentence
    (rather than each document) has an equal chance of being sampled as a false example for the NextSentence task."""
    document = doc_database[doc_idx]
    # Account for [CLS], [SEP], [SEP]
    max_num_tokens = max_seq_length - 3

    # We *usually* want to fill up the entire sequence since we are padding
    # to `max_seq_length` anyways, so short sequences are generally wasted
    # computation. However, we *sometimes*
    # (i.e., short_seq_prob == 0.1 == 10% of the time) want to use shorter
    # sequences to minimize the mismatch between pre-training and fine-tuning.
    # The `target_seq_length` is just a rough target however, whereas
    # `max_seq_length` is a hard limit.
    target_seq_length = max_num_tokens
    if random() < short_seq_prob:
        target_seq_length = randint(2, max_num_tokens)

    # We DON'T just concatenate all of the tokens from a document into a long
    # sequence and choose an arbitrary split point because this would make the
    # next sentence prediction task too easy. Instead, we split the input into
    # segments "A" and "B" based on the actual "sentences" provided by the user
    # input.
    instances = []
    current_chunk = []
    current_length = 0
    i = 0
    while i < len(document):
        segment = document[i]
        current_chunk.append(segment)
        current_length += len(segment)
        if i == len(document) - 1 or current_length >= target_seq_length:
            if current_chunk:
                # `a_end` is how many segments from `current_chunk` go into the `A`
                # (first) sentence.
                a_end = 1
                if len(current_chunk) >= 2:
                    a_end = randrange(1, len(current_chunk))

                tokens_a = []
                for j in range(a_end):
                    tokens_a.extend(current_chunk[j])

                tokens_b = []

                # Random next
                if len(current_chunk) == 1 or random() < 0.5:
                    is_random_next = True
                    target_b_length = target_seq_length - len(tokens_a)

                    # Sample a random document, with longer docs being sampled more frequently
                    random_document = doc_database.sample_doc(current_idx=doc_idx, sentence_weighted=True)

                    random_start = randrange(0, len(random_document))
                    for j in range(random_start, len(random_document)):
                        tokens_b.extend(random_document[j])
                        if len(tokens_b) >= target_b_length:
                            break
                    # We didn't actually use these segments so we "put them back" so
                    # they don't go to waste.
                    num_unused_segments = len(current_chunk) - a_end
                    i -= num_unused_segments
                # Actual next
                else:
                    is_random_next = False
                    for j in range(a_end, len(current_chunk)):
                        tokens_b.extend(current_chunk[j])
                truncate_seq_pair(tokens_a, tokens_b, max_num_tokens)

                assert len(tokens_a) >= 1
                assert len(tokens_b) >= 1

                tokens = ["[CLS]"] + tokens_a + ["[SEP]"] + tokens_b + ["[SEP]"]
                # The segment IDs are 0 for the [CLS] token, the A tokens and the first [SEP]
                # They are 1 for the B tokens and the final [SEP]
                segment_ids = [0 for _ in range(len(tokens_a) + 2)] + [1 for _ in range(len(tokens_b) + 1)]

                tokens, masked_lm_positions, masked_lm_labels = create_masked_lm_predictions(
                    tokens, masked_lm_prob, max_predictions_per_seq, whole_word_mask, vocab_list)

                instance = {
                    "tokens": tokens,
                    "segment_ids": segment_ids,
                    "is_random_next": is_random_next,
                    "masked_lm_positions": masked_lm_positions,
                    "masked_lm_labels": masked_lm_labels}
                instances.append(instance)
            current_chunk = []
            current_length = 0
        i += 1
    return instances

In [0]:
def create_training_file(docs, vocab_list, args, epoch_num):
    epoch_filename = args["output_dir"] + "/" + "epoch_{}.json".format(epoch_num)
    num_instances = 0
    with open(epoch_filename, 'w') as epoch_file:
        for doc_idx in trange(len(docs), desc="Document"):
            doc_instances = create_instances_from_document(
                docs, doc_idx, max_seq_length=args["max_seq_len"], short_seq_prob=args["short_seq_prob"],
                masked_lm_prob=args["masked_lm_prob"], max_predictions_per_seq=args["max_predictions_per_seq"],
                whole_word_mask=args["do_whole_word_mask"], vocab_list=vocab_list)
            # print(doc_instances[0])
            doc_instances = [json.dumps(instance) for instance in doc_instances]
            for instance in doc_instances:
                epoch_file.write(instance + '\n')
                num_instances += 1
    metrics_file = args["output_dir"] + "/" + "epoch_{}_metrics.json".format(epoch_num)
    with open(metrics_file, 'w') as metrics_file:
        metrics = {
            "num_training_examples": num_instances,
            "max_seq_len": args["max_seq_len"]
        }
        metrics_file.write(json.dumps(metrics))

In [0]:
train_corpus = "/content/drive/My Drive/BERT/corpus.txt"
output_dir = "/content/drive/My Drive/BERT/training"
bert_model = "bert-base-multilingual-cased"
do_lower_case = False
do_whole_word_mask = False
reduce_memory = True
num_workers = 1
epochs_to_generate = 4
max_seq_len = 512
short_seq_prob = 0.1 # Probability of making a short sentence as a training example
masked_lm_prob = 0.15 # Probability of masking each token for the LM task
max_predictions_per_seq = 20 # Maximum number of tokens to mask in each sequence

In [0]:
args = {
    "train_corpus": train_corpus,
    "output_dir": output_dir,
    "bert_model": bert_model,
    "do_lower_case": do_lower_case,
    "do_whole_word_mask": do_whole_word_mask,
    "reduce_memory": reduce_memory,
    "num_workers": num_workers,
    "epochs_to_generate": epochs_to_generate,
    "max_seq_len": max_seq_len,
    "short_seq_prob": short_seq_prob,
    "masked_lm_prob": masked_lm_prob,
    "max_predictions_per_seq": max_predictions_per_seq 
}

In [0]:
if num_workers > 1 and reduce_memory:
        raise ValueError("Cannot use multiple workers while reducing memory")

In [0]:
tokenizer = BertTokenizer.from_pretrained(bert_model, do_lower_case=do_lower_case)
vocab_list = list(tokenizer.vocab.keys())

In [0]:
    with DocumentDatabase(reduce_memory=reduce_memory) as docs:
        with open(train_corpus, 'r') as f:
            doc = []
            for line in tqdm(f, desc="Loading Dataset", unit=" lines"):
                line = line.strip()
                if line == "":
#                     print(doc)
                    docs.add_document(doc)
                    doc = []
                else:
                    tokens = tokenizer.tokenize(line)
                    # print(tokens)
                    doc.append(tokens)
            if doc:
                docs.add_document(doc)  # If the last doc didn't end on a newline, make sure it still gets added
        if len(docs) <= 1:
            exit("ERROR: No document breaks were found in the input file! These are necessary to allow the script to "
                 "ensure that random NextSentences are not sampled from the same document. Please add blank lines to "
                 "indicate breaks between documents in your input file. If your dataset does not contain multiple "
                 "documents, blank lines can be inserted at any natural boundary, such as the ends of chapters, "
                 "sections or paragraphs.")

        # output_dir.mkdir(exist_ok=True)
        if num_workers > 1:
            writer_workers = Pool(min(num_workers, epochs_to_generate))
            arguments = [(docs, vocab_list, args, idx) for idx in range(epochs_to_generate)]
            writer_workers.starmap(create_training_file, arguments)
        else:
            for epoch in trange(epochs_to_generate, desc="Epoch"):
#               print(1)
              create_training_file(docs, vocab_list, args, epoch)


Loading Dataset: 0 lines [00:00, ? lines/s][A
Loading Dataset: 275 lines [00:00, 2749.86 lines/s][A
Loading Dataset: 630 lines [00:00, 2945.87 lines/s][A
Loading Dataset: 958 lines [00:00, 3035.87 lines/s][A
Loading Dataset: 1171 lines [00:00, 1038.02 lines/s][A
Loading Dataset: 1537 lines [00:00, 1322.04 lines/s][A
Loading Dataset: 1820 lines [00:01, 1573.46 lines/s][A
Loading Dataset: 2061 lines [00:01, 1699.93 lines/s][A
Loading Dataset: 2443 lines [00:01, 2039.09 lines/s][A
Loading Dataset: 2808 lines [00:01, 2349.54 lines/s][A
Loading Dataset: 3193 lines [00:01, 2656.69 lines/s][A
Loading Dataset: 3527 lines [00:01, 2828.45 lines/s][A
Loading Dataset: 3856 lines [00:01, 2855.01 lines/s][A
Loading Dataset: 4174 lines [00:01, 2524.64 lines/s][A
Loading Dataset: 4456 lines [00:01, 2512.89 lines/s][A
Loading Dataset: 4728 lines [00:02, 2507.07 lines/s][A
Loading Dataset: 4994 lines [00:02, 2449.09 lines/s][A
Loading Dataset: 5291 lines [00:02, 2583.10 lines/s][A
Loa

# Training on pregenerated

Training was based on data generated for 4 epochs. Since both Masked LM and NextSentence Prediction were used in the project, BertForPretraining was a good choice of Pytorch-Transformers to work with.

In [0]:
from argparse import ArgumentParser
from pathlib import Path
import os
import torch
import logging
import json
import random
import numpy as np
from collections import namedtuple
from tempfile import TemporaryDirectory

from torch.utils.data import DataLoader, Dataset, RandomSampler
from torch.utils.data.distributed import DistributedSampler
from tqdm import tqdm

from pytorch_transformers import WEIGHTS_NAME, CONFIG_NAME
from pytorch_transformers.modeling_bert import BertForPreTraining
from pytorch_transformers.tokenization_bert import BertTokenizer
from pytorch_transformers.optimization import AdamW, WarmupLinearSchedule

InputFeatures = namedtuple("InputFeatures", "input_ids input_mask segment_ids lm_label_ids is_next")

log_format = '%(asctime)-10s: %(message)s'
logging.basicConfig(level=logging.INFO, format=log_format)

In [0]:
def convert_example_to_features(example, tokenizer, max_seq_length):
    tokens = example["tokens"]
    segment_ids = example["segment_ids"]
    is_random_next = example["is_random_next"]
    masked_lm_positions = example["masked_lm_positions"]
    masked_lm_labels = example["masked_lm_labels"]

    assert len(tokens) == len(segment_ids) <= max_seq_length  # The preprocessed data should be already truncated
    input_ids = tokenizer.convert_tokens_to_ids(tokens)
    masked_label_ids = tokenizer.convert_tokens_to_ids(masked_lm_labels)

    input_array = np.zeros(max_seq_length, dtype=np.int)
    input_array[:len(input_ids)] = input_ids

    mask_array = np.zeros(max_seq_length, dtype=np.bool)
    mask_array[:len(input_ids)] = 1

    segment_array = np.zeros(max_seq_length, dtype=np.bool)
    segment_array[:len(segment_ids)] = segment_ids

    lm_label_array = np.full(max_seq_length, dtype=np.int, fill_value=-1)
    lm_label_array[masked_lm_positions] = masked_label_ids

    features = InputFeatures(input_ids=input_array,
                             input_mask=mask_array,
                             segment_ids=segment_array,
                             lm_label_ids=lm_label_array,
                             is_next=is_random_next)
    return features

In [0]:
class PregeneratedDataset(Dataset):
    def __init__(self, training_path, epoch, tokenizer, num_data_epochs, reduce_memory=False):
        self.vocab = tokenizer.vocab
        self.tokenizer = tokenizer
        self.epoch = epoch
        self.data_epoch = epoch % num_data_epochs
        data_file = training_path + "/" + "epoch_{}.json".format(self.data_epoch)
        metrics_file = training_path + "/" + "epoch_{}_metrics.json".format(self.data_epoch)
        assert os.path.isfile(data_file) and os.path.isfile(metrics_file)
        # assert data_file.is_file() and metrics_file.is_file()
        f = open(metrics_file, 'r')
        metrics = json.loads(f.read())
        f.close()
        num_samples = metrics['num_training_examples']
        seq_len = metrics['max_seq_len']
        self.temp_dir = None
        self.working_dir = None
        if reduce_memory:
            self.temp_dir = TemporaryDirectory()
            self.working_dir = Path(self.temp_dir.name)
            input_ids = np.memmap(filename=self.working_dir/'input_ids.memmap',
                                  mode='w+', dtype=np.int32, shape=(num_samples, seq_len))
            input_masks = np.memmap(filename=self.working_dir/'input_masks.memmap',
                                    shape=(num_samples, seq_len), mode='w+', dtype=np.bool)
            segment_ids = np.memmap(filename=self.working_dir/'segment_ids.memmap',
                                    shape=(num_samples, seq_len), mode='w+', dtype=np.bool)
            lm_label_ids = np.memmap(filename=self.working_dir/'lm_label_ids.memmap',
                                     shape=(num_samples, seq_len), mode='w+', dtype=np.int32)
            lm_label_ids[:] = -1
            is_nexts = np.memmap(filename=self.working_dir/'is_nexts.memmap',
                                 shape=(num_samples,), mode='w+', dtype=np.bool)
        else:
            input_ids = np.zeros(shape=(num_samples, seq_len), dtype=np.int32)
            input_masks = np.zeros(shape=(num_samples, seq_len), dtype=np.bool)
            segment_ids = np.zeros(shape=(num_samples, seq_len), dtype=np.bool)
            lm_label_ids = np.full(shape=(num_samples, seq_len), dtype=np.int32, fill_value=-1)
            is_nexts = np.zeros(shape=(num_samples,), dtype=np.bool)
        print("Loading training examples for epoch {}".format(epoch))
        with open(data_file, 'r') as f:
            for i, line in enumerate(tqdm(f, total=num_samples, desc="Training examples")):
                line = line.strip()
                example = json.loads(line)
                features = convert_example_to_features(example, tokenizer, seq_len)
                input_ids[i] = features.input_ids
                segment_ids[i] = features.segment_ids
                input_masks[i] = features.input_mask
                lm_label_ids[i] = features.lm_label_ids
                is_nexts[i] = features.is_next
        assert i == num_samples - 1  # Assert that the sample count metric was true
        logging.info("Loading complete!")
        self.num_samples = num_samples
        self.seq_len = seq_len
        self.input_ids = input_ids
        self.input_masks = input_masks
        self.segment_ids = segment_ids
        self.lm_label_ids = lm_label_ids
        self.is_nexts = is_nexts

    def __len__(self):
        return self.num_samples

    def __getitem__(self, item):
        return (torch.tensor(self.input_ids[item].astype(np.int64)),
                torch.tensor(self.input_masks[item].astype(np.int64)),
                torch.tensor(self.segment_ids[item].astype(np.int64)),
                torch.tensor(self.lm_label_ids[item].astype(np.int64)),
                torch.tensor(self.is_nexts[item].astype(np.int64)))

In [0]:
pregenerated_data = "/content/drive/My Drive/BERT/training"
output_dir = "/content/drive/My Drive/BERT/finetuned"
bert_model = "bert-base-multilingual-cased"
do_lower_case = False
reduce_memory = True
epochs = 4
local_rank = -1 # local_rank for distributed training on gpus
no_cuda = False # Whether not to use CUDA when available
gradient_accumulation_steps = 1 # Number of updates steps to accumulate before performing a backward/update pass.
train_batch_size = 2 # Total batch size for training.
fp16 = False # Whether to use 16-bit float precision instead of 32-bit
loss_scale = 0 # Loss scaling to improve fp16 numeric stability. Only used when fp16 set to True.\n"
                        # "0 (default value): dynamic loss scaling.\n"
                        # "Positive power of 2: static loss scaling value.\n
warmup_steps = 0.1 # Linear warmup over warmup_steps.
adam_epsilon = 1e-8 # Epsilon for Adam optimizer.
learning_rate = 5e-5 # The initial learning rate for Adam.
seed = 1003 # random seed for initialization

In [0]:
    assert os.path.isdir(pregenerated_data), \
        "--pregenerated_data should point to the folder of files made by pregenerate_training_data.py!"

In [0]:
    samples_per_epoch = []
    for i in range(epochs):
        epoch_file = pregenerated_data + "/" + "epoch_{}.json".format(i)
        metrics_file = pregenerated_data + "/" + "epoch_{}_metrics.json".format(i)
        if os.path.isfile(epoch_file) and os.path.isfile(metrics_file):
            f=open(metrics_file, "r")
            metrics = json.loads(f.read())
            f.close
            samples_per_epoch.append(metrics['num_training_examples'])
        else:
            if i == 0:
                exit("No training data was found!")
            print("Warning! There are fewer epochs of pregenerated data ({i}) than training epochs ({epochs}).")
            print("This script will loop over the available data, but training diversity may be negatively impacted.")
            num_data_epochs = i
            break
    else:
        num_data_epochs = epochs

In [0]:
    if local_rank == -1 or no_cuda:
        device = torch.device("cuda" if torch.cuda.is_available() and not no_cuda else "cpu")
        n_gpu = torch.cuda.device_count()
    else:
        torch.cuda.set_device(args.local_rank)
        device = torch.device("cuda", args.local_rank)
        n_gpu = 1
        # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
        torch.distributed.init_process_group(backend='nccl')
    print("device: {} n_gpu: {}, distributed training: {}, 16-bits training: {}".format(
        device, n_gpu, bool(local_rank != -1), fp16))

device: cuda n_gpu: 1, distributed training: False, 16-bits training: False


In [0]:
if gradient_accumulation_steps < 1:
        raise ValueError("Invalid gradient_accumulation_steps parameter: {}, should be >= 1".format(
                            args.gradient_accumulation_steps))

In [0]:
train_batch_size = train_batch_size // gradient_accumulation_steps

In [0]:
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    if n_gpu > 0:
        torch.cuda.manual_seed_all(seed)

In [0]:
# if args.output_dir.is_dir() and list(args.output_dir.iterdir()):
#         logging.warning(f"Output directory ({args.output_dir}) already exists and is not empty!")
#     # args.output_dir.mkdir(parents=True, exist_ok=True)

In [0]:
    tokenizer = BertTokenizer.from_pretrained(bert_model, do_lower_case=do_lower_case)

In [0]:
tokenizer.vocab_size

119547

In [0]:
    total_train_examples = 0
    for i in range(epochs):
        # The modulo takes into account the fact that we may loop over limited epochs of data
        total_train_examples += samples_per_epoch[i % len(samples_per_epoch)]

In [0]:
    num_train_optimization_steps = int(
        total_train_examples / train_batch_size / gradient_accumulation_steps)
    if local_rank != -1:
        num_train_optimization_steps = num_train_optimization_steps // torch.distributed.get_world_size()

In [0]:
    model = BertForPreTraining.from_pretrained(bert_model)


  0%|          | 0/714314041 [00:00<?, ?B/s][A
  0%|          | 52224/714314041 [00:00<36:11, 328997.14B/s][A
  0%|          | 261120/714314041 [00:00<28:06, 423510.82B/s][A
  0%|          | 940032/714314041 [00:00<20:30, 579866.60B/s][A
  0%|          | 2588672/714314041 [00:00<14:32, 816079.03B/s][A
  1%|          | 5174272/714314041 [00:00<10:16, 1150267.28B/s][A
  1%|          | 7402496/714314041 [00:00<07:22, 1598939.14B/s][A
  1%|▏         | 10531840/714314041 [00:00<05:14, 2235245.75B/s][A
  2%|▏         | 13792256/714314041 [00:01<03:45, 3102058.55B/s][A
  2%|▏         | 16100352/714314041 [00:01<02:49, 4122787.93B/s][A
  3%|▎         | 19597312/714314041 [00:01<02:03, 5606385.62B/s][A
  3%|▎         | 22443008/714314041 [00:01<01:36, 7167397.63B/s][A
  4%|▎         | 25605120/714314041 [00:01<01:13, 9332469.59B/s][A
  4%|▍         | 28849152/714314041 [00:01<01:00, 11319490.54B/s][A
  4%|▍         | 32122880/714314041 [00:01<00:48, 14083430.45B/s][A
  5%|▍     

In [0]:
    # Prepare model
    if fp16:
        model.half()
    model.to(device)

BertForPreTraining(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(119547, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): BertLayerNorm()
      (dropout): Dropout(p=0.1)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): BertLayerNorm()
              (dropout): Dropout(p=0.1)
            )
          )
          (intermediate): BertIn

In [0]:
    # Prepare optimizer
    param_optimizer = list(model.named_parameters())
    no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
    optimizer_grouped_parameters = [
        {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
         'weight_decay': 0.01},
        {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
    ]

In [0]:
    if fp16:
        try:
            from apex.optimizers import FP16_Optimizer
            from apex.optimizers import FusedAdam
        except ImportError:
            raise ImportError(
                "Please install apex from https://www.github.com/nvidia/apex to use distributed and fp16 training.")

        optimizer = FusedAdam(optimizer_grouped_parameters,
                              lr=learning_rate,
                              bias_correction=False,
                              max_grad_norm=1.0)
        if loss_scale == 0:
            optimizer = FP16_Optimizer(optimizer, dynamic_loss_scale=True)
        else:
            optimizer = FP16_Optimizer(optimizer, static_loss_scale=loss_scale)
    else:
        optimizer = AdamW(optimizer_grouped_parameters, lr=learning_rate, eps=adam_epsilon)
    scheduler = WarmupLinearSchedule(optimizer, warmup_steps=warmup_steps, t_total=num_train_optimization_steps)

As we can see, 28743 examples were generated which is definitely not a lot of data, but even with this the model was still able to improve. 
The loss after 4 epochs was 1.66

In [0]:
    print("***** Running training *****")
    print(f"  Num examples = {total_train_examples}")
    print("  Batch size = %d", train_batch_size)
    print("  Num steps = %d", num_train_optimization_steps)

***** Running training *****
  Num examples = 28743
  Batch size = %d 2
  Num steps = %d 14371


In [0]:
    global_step = 0
    model.train()
    for epoch in range(epochs):
        epoch_dataset = PregeneratedDataset(epoch=epoch, training_path=pregenerated_data, tokenizer=tokenizer,
                                            num_data_epochs=num_data_epochs, reduce_memory=reduce_memory)
        if local_rank == -1:
            train_sampler = RandomSampler(epoch_dataset)
        else:
            train_sampler = DistributedSampler(epoch_dataset)
        train_dataloader = DataLoader(epoch_dataset, sampler=train_sampler, batch_size=train_batch_size)
        tr_loss = 0
        nb_tr_examples, nb_tr_steps = 0, 0
        with tqdm(total=len(train_dataloader), desc=f"Epoch {epoch}") as pbar:
            for step, batch in enumerate(train_dataloader):
                batch = tuple(t.to(device) for t in batch)
                input_ids, input_mask, segment_ids, lm_label_ids, is_next = batch
                outputs = model(input_ids, segment_ids, input_mask, lm_label_ids, is_next)
                loss = outputs[0]
                if n_gpu > 1:
                    loss = loss.mean() # mean() to average on multi-gpu.
                if gradient_accumulation_steps > 1:
                    loss = loss / gradient_accumulation_steps
                if fp16:
                    optimizer.backward(loss)
                else:
                    loss.backward()
                # print("Loss: ", loss.item())
                tr_loss += loss.item()
                nb_tr_examples += input_ids.size(0)
                nb_tr_steps += 1
                pbar.update(1)
                mean_loss = tr_loss * gradient_accumulation_steps / nb_tr_steps
                pbar.set_postfix_str(f"Loss: {mean_loss:.5f}")
                if (step + 1) % gradient_accumulation_steps == 0:
                    scheduler.step()  # Update learning rate schedule
                    optimizer.step()
                    optimizer.zero_grad()
                    global_step += 1



Training examples:   0%|          | 0/7777 [00:00<?, ?it/s][A
Training examples:   2%|▏         | 120/7777 [00:00<00:06, 1198.50it/s][A

Loading training examples for epoch 0



Training examples:   3%|▎         | 221/7777 [00:00<00:06, 1132.61it/s][A
Training examples:   4%|▍         | 324/7777 [00:00<00:06, 1099.07it/s][A
Training examples:   7%|▋         | 552/7777 [00:00<00:05, 1301.18it/s][A
Training examples:   9%|▊         | 670/7777 [00:00<00:05, 1209.79it/s][A
Training examples:  10%|█         | 788/7777 [00:00<00:05, 1199.29it/s][A
Training examples:  12%|█▏        | 903/7777 [00:00<00:05, 1177.45it/s][A
Training examples:  13%|█▎        | 1021/7777 [00:00<00:05, 1173.99it/s][A
Training examples:  15%|█▍        | 1136/7777 [00:00<00:05, 1130.69it/s][A
Training examples:  16%|█▌        | 1262/7777 [00:01<00:05, 1165.55it/s][A
Training examples:  18%|█▊        | 1429/7777 [00:01<00:04, 1280.38it/s][A
Training examples:  20%|██        | 1560/7777 [00:01<00:05, 1239.79it/s][A
Training examples:  22%|██▏       | 1687/7777 [00:01<00:05, 1193.19it/s][A
Training examples:  23%|██▎       | 1809/7777 [00:01<00:05, 1168.46it/s][A
Training examples

Loading training examples for epoch 1



Training examples:   3%|▎         | 207/6515 [00:00<00:06, 1045.09it/s][A
Training examples:   5%|▍         | 310/6515 [00:00<00:05, 1040.51it/s][A
Training examples:   6%|▋         | 409/6515 [00:00<00:05, 1022.29it/s][A
Training examples:   8%|▊         | 501/6515 [00:00<00:06, 988.92it/s] [A
Training examples:   9%|▉         | 604/6515 [00:00<00:05, 1000.47it/s][A
Training examples:  11%|█         | 714/6515 [00:00<00:05, 1028.18it/s][A
Training examples:  13%|█▎        | 830/6515 [00:00<00:05, 1062.78it/s][A
Training examples:  15%|█▌        | 1005/6515 [00:00<00:04, 1201.64it/s][A
Training examples:  17%|█▋        | 1128/6515 [00:01<00:04, 1183.64it/s][A
Training examples:  20%|█▉        | 1302/6515 [00:01<00:03, 1308.05it/s][A
Training examples:  22%|██▏       | 1438/6515 [00:01<00:03, 1290.56it/s][A
Training examples:  24%|██▍       | 1571/6515 [00:01<00:04, 1184.90it/s][A
Training examples:  26%|██▌       | 1694/6515 [00:01<00:04, 1118.63it/s][A
Training examples:

Loading training examples for epoch 2



Training examples:   3%|▎         | 202/7529 [00:00<00:07, 1030.63it/s][A
Training examples:   4%|▍         | 309/7529 [00:00<00:06, 1041.79it/s][A
Training examples:   6%|▌         | 442/7529 [00:00<00:06, 1114.15it/s][A
Training examples:   8%|▊         | 598/7529 [00:00<00:05, 1216.86it/s][A
Training examples:   9%|▉         | 706/7529 [00:00<00:06, 1125.27it/s][A
Training examples:  11%|█         | 813/7529 [00:00<00:06, 1108.01it/s][A
Training examples:  12%|█▏        | 918/7529 [00:00<00:06, 1055.19it/s][A
Training examples:  14%|█▎        | 1022/7529 [00:00<00:06, 1050.04it/s][A
Training examples:  15%|█▍        | 1125/7529 [00:01<00:06, 1037.39it/s][A
Training examples:  17%|█▋        | 1283/7529 [00:01<00:05, 1153.22it/s][A
Training examples:  19%|█▊        | 1402/7529 [00:01<00:05, 1097.11it/s][A
Training examples:  20%|██        | 1541/7529 [00:01<00:05, 1169.04it/s][A
Training examples:  22%|██▏       | 1662/7529 [00:01<00:05, 1137.25it/s][A
Training examples:

Loading training examples for epoch 3



Training examples:   3%|▎         | 196/6922 [00:00<00:06, 984.68it/s][A
Training examples:   4%|▍         | 293/6922 [00:00<00:06, 979.33it/s][A
Training examples:   6%|▌         | 385/6922 [00:00<00:06, 959.13it/s][A
Training examples:   8%|▊         | 528/6922 [00:00<00:06, 1061.71it/s][A
Training examples:   9%|▉         | 624/6922 [00:00<00:06, 1026.11it/s][A
Training examples:  10%|█         | 720/6922 [00:00<00:06, 1004.69it/s][A
Training examples:  12%|█▏        | 821/6922 [00:00<00:06, 1005.78it/s][A
Training examples:  13%|█▎        | 920/6922 [00:00<00:06, 998.79it/s] [A
Training examples:  15%|█▍        | 1020/6922 [00:01<00:05, 997.81it/s][A
Training examples:  16%|█▌        | 1121/6922 [00:01<00:05, 999.49it/s][A
Training examples:  18%|█▊        | 1220/6922 [00:01<00:05, 996.08it/s][A
Training examples:  19%|█▉        | 1326/6922 [00:01<00:05, 1010.14it/s][A
Training examples:  21%|██        | 1466/6922 [00:01<00:04, 1099.94it/s][A
Training examples:  23%|█

In [0]:
    # Save a trained model
    if  n_gpu > 1 and torch.distributed.get_rank() == 0  or n_gpu <=1 :
        print("** ** * Saving fine-tuned model ** ** * ")
        model.save_pretrained(output_dir)
        tokenizer.save_pretrained(output_dir)

** ** * Saving fine-tuned model ** ** * 


In [0]:
trained_model = model
trained_tokenizer = tokenizer