# Punctuation restoration

The sequence tagging with BERT-like models is quite common among researchers nowdays
+ For example, the following research https://www.researchgate.net/publication/348618580_Automatic_punctuation_restoration_with_BERT_models (WordPiece + BERT uncased) suggests tagging words with labes `EMPTY`, `PERIOD`, `COMMA` etc. when they are followed by a punctuation symbol or not. The minor downside of suggested approach is that they use full-fledged dictionary, which may me memory-consuming. 
+ The overlapping sliding window is suggested to deal with sentences of the size exceeding input length. Due to overelap, for the same pieces we may get several predictions. The mean of them will be taken.

First of all, in my research I'd like to concentrate on correcting commas in a sentence, so I may limit my output data to "COMMA" and "EMP" for now

1. To train the model from scratch, we have to perform conversion of BPE encoded data to a form, in which our objective would be to predict, wheither a token is followed by a comma/period.

In [5]:
import youtokentome as yttm

In [1]:
text = """
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed tristique lacus eu massa luctus, mollis bibendum dolor imperdiet. Donec venenatis interdum sodales. Vestibulum et risus quis urna imperdiet egestas a eget lacus. Sed fringilla varius aliquam. Sed egestas ligula nec sodales pretium. Curabitur dapibus eget nisi sit amet efficitur. Etiam mauris orci, finibus vel quam nec, laoreet viverra est. Pellentesque quis enim ut purus pretium condimentum ut quis sem.

Vestibulum dolor diam, efficitur vel suscipit vel, viverra at massa. Donec suscipit lectus eget ligula aliquet, ut pulvinar dolor pharetra. Donec et justo ex. Aliquam non auctor elit, quis elementum tellus. Pellentesque a odio egestas leo facilisis luctus ut eu libero. Praesent eu tincidunt orci, quis gravida erat. Aenean consectetur orci eros, id porta felis tincidunt eget. Fusce sed lectus dictum, fermentum tellus et, sagittis nunc. Nulla facilisi.

In venenatis venenatis consequat. Etiam tellus diam, maximus et dapibus sollicitudin, euismod id magna. Nam at iaculis arcu, a tempor orci. Maecenas dictum maximus diam id finibus. Donec in feugiat dui. Nam scelerisque, risus vel pharetra condimentum, sem nisi eleifend augue, rutrum dapibus lorem ex eget quam. Duis gravida mollis lectus non bibendum. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia curae;

Maecenas ultrices id nisl varius dapibus. Donec pellentesque turpis sit amet leo maximus, vel ornare nisl laoreet. Donec venenatis cursus velit, in sagittis leo rutrum et. Vestibulum nec est ac arcu porttitor convallis quis et metus. Morbi urna lectus, interdum vitae purus a, pulvinar congue orci. Vivamus fermentum non justo eu hendrerit. Vestibulum congue nunc eget felis scelerisque tempor.

Nunc fringilla malesuada ullamcorper. Maecenas mollis vitae mi vitae sagittis. Duis eleifend nisi in dapibus maximus. Praesent varius gravida tincidunt. Donec lacinia interdum neque maximus hendrerit. Sed ornare tellus sed ex facilisis tincidunt. Sed luctus, neque at semper luctus, nibh orci viverra justo, vel suscipit est arcu non justo. In posuere volutpat nulla, vitae dignissim purus sagittis quis. Sed blandit ante varius neque lacinia dictum. Vestibulum quis mollis elit, vitae feugiat elit. """

1. first of all, I gonna test, if yttm will behave nicely with text with commas, separated by space from both sides. This approach misses some information, like common word endings that are followed by a comma, yet it may be simplier to implement

In [7]:
total_words=3000000
bpe_model_path = "bpe.yttm"
def create_bpe_tokenizer_from_scratch(corpus, train_data_path="yttm_train_data.txt"):
    with open(train_data_path, "w") as _file:
        _file.writelines(corpus)
    # Training model
    # (data, model, vocab_size, coverage, n_threads=-1, pad_id=0, unk_id=1, bos_id=2, eos_id=3)
    return yttm.BPE.train(data=train_data_path, vocab_size=total_words, model=bpe_model_path)

In [12]:
bpe = create_bpe_tokenizer_from_scratch(text.replace(",", " ,"))

In [11]:
print(' '.join(bpe.vocab())[:300])

<PAD> <UNK> <BOS> <EOS> ▁ e i u s t a n l r m c o d . v p g , b q f D x V S h j P N M I E A L F C ; us is ▁e en it or um ▁v qu ▁l ci et ▁a ▁s er es ▁n ec on in el ib ▁d ul ▁m ur ▁p ▁f am im ▁t ra ▁c ar ▁ma ▁qu ent ol tus que est ▁or ▁eg ▁D at ▁quis ▁con lis ed ic em ▁Don ▁orci ▁Donec ▁i ibus ell ▁ve


In [20]:
test = "And , que est , at ."
enc = bpe.encode(test)
enc

[374, 11, 17, 47, 78, 5, 252, 47, 230, 4, 18]

In [27]:
COMMA = bpe.encode(",")[0]

In [22]:
[bpe.decode([symb])[0] for symb in enc]

['A', 'n', 'd', ',', 'qu', 'e', 'est', ',', 'at', '', '.']

### algo plan
1. encode incoming text
    - to ensure the word splitting won't mess things up
2. strip the commas from encoded sent
3. generate target mask of emps and commas 
    - the length to be the same as encoded sent without commas

In [37]:
def encode_commas(sent):
    # for-loopish version can be optimized
    result = []
    for idx, token in enumerate(sent):
        if idx+1 < len(sent):
            if token == COMMA:
                continue
            if sent[idx+1] == COMMA:
                result.append("[COMMA]")
            else:
                result.append("[EMP]")
        else:
            # no comma checking for last token
            result.append("[EMP]")
    return result

In [40]:
assert len(encode_commas(enc)) == len(list(filter(lambda x: x!=COMMA, enc)))

In [39]:
encode_commas(enc)

['[EMP]',
 '[EMP]',
 '[COMMA]',
 '[EMP]',
 '[EMP]',
 '[COMMA]',
 '[EMP]',
 '[EMP]',
 '[EMP]']