<a href="https://colab.research.google.com/github/zaaabik/hse/blob/master/application_dl/nlp_3/Language_modeling_seminar_solved.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### N-gram language models or how to write scientific papers

We shall train our language model on a corpora of [ArXiv](http://arxiv.org/) articles and see if we can generate a new one!

![img](https://media.npr.org/assets/img/2013/12/10/istock-18586699-monkey-computer_brick-16e5064d3378a14e0e4c2da08857efe03c04695e-s800-c85.jpg)

_data by neelshah18 from [here](https://www.kaggle.com/neelshah18/arxivdataset/)_

_Disclaimer: this has nothing to do with actual science. But it's fun, so who cares?!_

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm
%matplotlib inline

In [None]:
# Alternative manual download link: https://yadi.sk/d/_nGyU2IajjR9-w
!wget "https://www.dropbox.com/s/99az9n1b57qkd9j/arxivData.json.tar.gz?dl=1" -O arxivData.json.tar.gz
!tar -xvzf arxivData.json.tar.gz
data = pd.read_json("./arxivData.json")
data.sample(n=5)

--2023-03-12 20:21:16--  https://www.dropbox.com/s/99az9n1b57qkd9j/arxivData.json.tar.gz?dl=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.65.18, 2620:100:601a:18::a27d:712
Connecting to www.dropbox.com (www.dropbox.com)|162.125.65.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /s/dl/99az9n1b57qkd9j/arxivData.json.tar.gz [following]
--2023-03-12 20:21:16--  https://www.dropbox.com/s/dl/99az9n1b57qkd9j/arxivData.json.tar.gz
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc5f23d84df0b21d7ca79922a72e.dl.dropboxusercontent.com/cd/0/get/B4H-cIahBMEAbKIX26GdGmjecU7KofluuFsYHCwK5dxZ8cdp8-Sd-2hKyl-YkjGZh_ECyocvmsYjCBNOoM57j3muTbmqVFHGB5ayTcqkdF-oxJ_x0b9msSaRxTIUg1jPmCLyFLl_4VIHoWR35GDeQPW7OWiyldPizDpy7t9gW948iQ/file?dl=1# [following]
--2023-03-12 20:21:17--  https://uc5f23d84df0b21d7ca79922a72e.dl.dropboxusercontent.com/cd/0/get/B4H-cIahBMEAbKIX26GdGmjecU7KofluuFsYHCwK5dxZ8cdp8

Unnamed: 0,author,day,id,link,month,summary,tag,title,year
25224,"[{'name': 'Holger R. Roth'}, {'name': 'Le Lu'}...",22,1506.06448v1,"[{'rel': 'alternate', 'href': 'http://arxiv.or...",6,Automatic organ segmentation is an important y...,"[{'term': 'cs.CV', 'scheme': 'http://arxiv.org...",DeepOrgan: Multi-level Deep Convolutional Netw...,2015
1851,"[{'name': 'Markus Kliegl'}, {'name': 'Siddhart...",25,1710.09026v2,"[{'rel': 'alternate', 'href': 'http://arxiv.or...",10,We propose and evaluate new techniques for com...,"[{'term': 'cs.LG', 'scheme': 'http://arxiv.org...",Trace norm regularization and faster inference...,2017
2260,"[{'name': 'An Bian'}, {'name': 'Kfir Y. Levy'}...",4,1711.02515v3,"[{'rel': 'alternate', 'href': 'http://arxiv.or...",11,DR-submodular continuous functions are importa...,"[{'term': 'cs.LG', 'scheme': 'http://arxiv.org...",Continuous DR-submodular Maximization: Structu...,2017
7047,"[{'name': 'Luming Tang'}, {'name': 'Yexiang Xu...",17,1709.05612v1,"[{'rel': 'alternate', 'href': 'http://arxiv.or...",9,Multi-Entity Dependence Learning (MEDL) explor...,"[{'term': 'cs.LG', 'scheme': 'http://arxiv.org...",Multi-Entity Dependence Learning with Rich Con...,2017
12403,"[{'name': 'Zhinus Marzi'}, {'name': 'Soorya Go...",15,1801.04695v2,"[{'rel': 'alternate', 'href': 'http://arxiv.or...",1,Deep neural networks represent the state of th...,"[{'term': 'stat.ML', 'scheme': 'http://arxiv.o...",Sparsity-based Defense against Adversarial Att...,2018


In [None]:
# assemble lines: concatenate title and description
lines = data.apply(lambda row: row['title'] + ' ; ' + row['summary'], axis=1).tolist()

sorted(lines, key=len)[:3]

['Differential Contrastive Divergence ; This paper has been retracted.',
 'What Does Artificial Life Tell Us About Death? ; Short philosophical essay',
 'P=NP ; We claim to resolve the P=?NP problem via a formal argument for P=NP.']

### Tokenization

You know the dril. The data is messy. Go clean the data. Use WordPunctTokenizer or something.


In [None]:
# Task: convert lines (in-place) into strings of space-separated tokens. import & use WordPunctTokenizer
from nltk import WordPunctTokenizer
tokenizer = WordPunctTokenizer()
lines = [
    ' '.join(
        tokenizer.tokenize(line.lower())
    ) for line in tqdm(lines)
]

  0%|          | 0/41000 [00:00<?, ?it/s]

In [None]:
assert sorted(lines, key=len)[0] == \
    'differential contrastive divergence ; this paper has been retracted .'
assert sorted(lines, key=len)[2] == \
    'p = np ; we claim to resolve the p =? np problem via a formal argument for p = np .'

### N-Gram Language Model

A language model is a probabilistic model that estimates text probability: the joint probability of all tokens $w_t$ in text $X$: $P(X) = P(w_1, \dots, w_T)$.

It can do so by following the chain rule:
$$P(w_1, \dots, w_T) = P(w_1)P(w_2 \mid w_1)\dots P(w_T \mid w_1, \dots, w_{T-1}).$$ 

The problem with such approach is that the final term $P(w_T \mid w_1, \dots, w_{T-1})$ depends on $n-1$ previous words. This probability is impractical to estimate for long texts, e.g. $T = 1000$.

One popular approximation is to assume that next word only depends on a finite amount of previous words:

$$P(w_t \mid w_1, \dots, w_{t - 1}) = P(w_t \mid w_{t - n + 1}, \dots, w_{t - 1})$$

Such model is called __n-gram language model__ where n is a parameter. For example, in 3-gram language model, each word only depends on 2 previous words. 

$$
    P(w_1, \dots, w_n) = \prod_t P(w_t \mid w_{t - n + 1}, \dots, w_{t - 1}).
$$

You can also sometimes see such approximation under the name of _n-th order markov assumption_.

The first stage to building such a model is counting all word occurences given N-1 previous words

In [None]:
from tqdm import tqdm
from collections import defaultdict, Counter

# special tokens: 
# - unk represents absent tokens, 
# - eos is a special token after the end of sequence

UNK, EOS = "_UNK_", "_EOS_"

def count_ngrams(lines, n):
    """
    Count how many times each word occured after (n - 1) previous words
    :param lines: an iterable of strings with space-separated tokens
    :returns: a dictionary { tuple(prefix_tokens): {next_token_1: count_1, next_token_2: count_2}}

    When building counts, please consider the following two edge cases
    - if prefix is shorter than (n - 1) tokens, it should be padded with UNK. For n=3,
      empty prefix: "" -> (UNK, UNK)
      short prefix: "the" -> (UNK, the)
      long prefix: "the new approach" -> (new, approach)
    - you should add a special token, EOS, at the end of each sequence
      "... with deep neural networks ." -> (..., with, deep, neural, networks, ., EOS)
      count the probability of this token just like all others.
    """
    counts = defaultdict(Counter)
    # counts[(word1, word2)][word3] = how many times word3 occured after (word1, word2)

    for line in tqdm(lines, desc='N-grams'):
        unk_prefix = ' '.join([UNK] * (n - 1))
        eos_suffix = EOS
        tokens = f'{unk_prefix} {line} {eos_suffix}'.split()
        for i in range(n - 1, len(tokens)):
            n_gram = tuple(tokens[i - n + 1: i])
            counts[n_gram].update([tokens[i]])

    
    return counts


In [None]:
# let's test it
dummy_lines = sorted(lines, key=len)[:100]
dummy_counts = count_ngrams(dummy_lines, n=3)
assert set(map(len, dummy_counts.keys())) == {2}, "please only count {n-1}-grams"
assert len(dummy_counts[('_UNK_', '_UNK_')]) == 78
assert dummy_counts['_UNK_', 'a']['note'] == 3
assert dummy_counts['p', '=']['np'] == 2
assert dummy_counts['author', '.']['_EOS_'] == 1

N-grams: 100%|██████████| 100/100 [00:00<00:00, 5401.41it/s]


In [None]:
dummy_counts[('_UNK_', 'a')]

Counter({'notation': 1,
         'note': 3,
         'comment': 1,
         'machine': 2,
         'theory': 1,
         'survey': 1,
         'history': 1,
         'new': 1,
         'remark': 1,
         'primer': 1})

Once we can count N-grams, we can build a probabilistic language model.
The simplest way to compute probabilities is in proporiton to counts:

$$ P(w_t | prefix) = { Count(prefix, w_t) \over \sum_{\hat w} Count(prefix, \hat w) } $$

In [None]:
class NGramLanguageModel:    
    def __init__(self, lines, n):
        """ 
        Train a simple count-based language model: 
        compute probabilities P(w_t | prefix) given ngram counts
        
        :param n: computes probability of next token given (n - 1) previous words
        :param lines: an iterable of strings with space-separated tokens
        """
        assert n >= 1
        self.n = n
    
        counts = count_ngrams(lines, self.n)
        
        # compute token proabilities given counts
        self.probs = defaultdict(Counter)
        # probs[(word1, word2)][word3] = P(word3 | word1, word2)
        
        # populate self.probs with actual probabilities
        for k,v in tqdm(counts.items()):
            s = sum(v.values())
            for word, cout in v.items():
                self.probs[k][word] = counts[k][word] / s 
            
    def get_possible_next_tokens(self, prefix):
        """
        :param prefix: string with space-separated prefix tokens
        :returns: a dictionary {token : it's probability} for all tokens with positive probabilities
        """
        prefix = prefix.split()
        prefix = prefix[max(0, len(prefix) - self.n + 1):]
        prefix = [ UNK ] * (self.n - 1 - len(prefix)) + prefix
        return self.probs[tuple(prefix)]
    
    def get_next_token_prob(self, prefix, next_token):
        """
        :param prefix: string with space-separated prefix tokens
        :param next_token: the next token to predict probability for
        :returns: P(next_token|prefix) a single number, 0 <= P <= 1
        """
        return self.get_possible_next_tokens(prefix).get(next_token, 0)

Let's test it!

In [None]:
dummy_lm = NGramLanguageModel(dummy_lines, n=3)

p_initial = dummy_lm.get_possible_next_tokens('') # '' -> ['_UNK_', '_UNK_']
assert np.allclose(p_initial['learning'], 0.02)
assert np.allclose(p_initial['a'], 0.13)
assert np.allclose(p_initial.get('meow', 0), 0)
assert np.allclose(sum(p_initial.values()), 1)

p_a = dummy_lm.get_possible_next_tokens('a') # '' -> ['_UNK_', 'a']
assert np.allclose(p_a['machine'], 0.15384615)
assert np.allclose(p_a['note'], 0.23076923)
assert np.allclose(p_a.get('the', 0), 0)
assert np.allclose(sum(p_a.values()), 1)

assert np.allclose(dummy_lm.get_possible_next_tokens('a note')['on'], 1)
assert dummy_lm.get_possible_next_tokens('a machine') == \
    dummy_lm.get_possible_next_tokens("there have always been ghosts in a machine"), \
    "your 3-gram model should only depend on 2 previous words"

N-grams: 100%|██████████| 100/100 [00:00<00:00, 10115.53it/s]
100%|██████████| 2086/2086 [00:00<00:00, 269267.79it/s]


Now that you've got a working n-gram language model, let's see what sequences it can generate. But first, let's train it on the whole dataset.

In [None]:
lm = NGramLanguageModel(lines, n=3)

N-grams: 100%|██████████| 41000/41000 [00:19<00:00, 2054.51it/s]
100%|██████████| 1219478/1219478 [00:06<00:00, 184895.01it/s]


The process of generating sequences is... well, it's sequential. You maintain a list of tokens and iteratively add next token by sampling with probabilities.

$ X = [] $

__forever:__
* $w_{next} \sim P(w_{next} | X)$
* $X = concat(X, w_{next})$


Instead of sampling with probabilities, one can also try always taking most likely token, sampling among top-K most likely tokens or sampling with temperature. In the latter case (temperature), one samples from

$$w_{next} \sim {P(w_{next} | X) ^ {1 / \tau} \over \sum_{\hat w} P(\hat w | X) ^ {1 / \tau}}$$

Where $\tau > 0$ is model temperature. If $\tau << 1$, more likely tokens will be sampled with even higher probability while less likely tokens will vanish.

In [None]:
def get_next_token(lm, prefix, temperature=1.0):
    """
    return next token after prefix;
    :param temperature: samples proportionally to lm probabilities ^ (1 / temperature)
        if temperature == 0, always takes most likely token. Break ties arbitrarily.
    """
    next_tokens = lm.get_possible_next_tokens(prefix)
    if temperature == 0:
        sorted_next_tokens = dict(
            sorted(tuple(next_tokens.items()), key=lambda x:x[1], 
                   reverse=True)
        )
        next_token = tuple(sorted_next_tokens.items())[0][0]
    else:
        sum_probs = sum([
            prob ** (1 / temperature) for prob in next_tokens.values()
        ])

        next_tokens = {
            token: prob ** (1 / temperature) / sum_probs
            for token, prob in next_tokens.items()
        }
        tokens = list(next_tokens.keys())
        probs = list(next_tokens.values())
        next_token = np.random.choice(tokens, 1, p=probs)[0]
    return next_token

In [None]:
from collections import Counter
test_freqs = Counter([get_next_token(lm, 'there have') for _ in range(10000)])
assert 250 < test_freqs['not'] < 450
assert 8500 < test_freqs['been'] < 9500
assert 1 < test_freqs['lately'] < 200

test_freqs = Counter([get_next_token(lm, 'deep', temperature=1.0) for _ in range(10000)])
assert 1500 < test_freqs['learning'] < 3000
test_freqs = Counter([get_next_token(lm, 'deep', temperature=0.5) for _ in range(10000)])
assert 8000 < test_freqs['learning'] < 9000
test_freqs = Counter([get_next_token(lm, 'deep', temperature=0.0) for _ in range(10000)])
assert test_freqs['learning'] == 10000

print("Looks nice!")

Looks nice!


Let's have fun with this model

In [None]:
prefix = 'artificial' # <- your ideas :)

for i in range(100):
    prefix += ' ' + get_next_token(lm, prefix)
    if prefix.endswith(EOS) or len(lm.get_possible_next_tokens(prefix)) == 0:
        break
        
print(prefix)

artificial brain based on their variance ( i . e .,} segments ) from images and their community membership for the research results , allowing it to create word - vector product evaluations . _EOS_


In [None]:
prefix = 'bridging the' # <- more of your ideas

for i in range(100):
    prefix += ' ' + get_next_token(lm, prefix, temperature=0.5)
    if prefix.endswith(EOS) or len(lm.get_possible_next_tokens(prefix)) == 0:
        break
        
print(prefix)

bridging the gap between the source and target domains . _EOS_


### Evaluating language models: perplexity

Perplexity is a measure of how well does your model approximate true probability distribution behind data. __Smaller perplexity = better model__.

To compute perplexity on one sentence, use:
$$
    {\mathbb{P}}(w_1 \dots w_N) = P(w_1, \dots, w_N)^{-\frac1N} = \left( \prod_t P(w_t \mid w_{t - n}, \dots, w_{t - 1})\right)^{-\frac1N},
$$


On the corpora level, perplexity is a product of probabilities of all tokens in all sentences to the power of 1, divided by __total length of all sentences__ in corpora.

This number can quickly get too small for float32/float64 precision, so we recommend you to first compute log-perplexity (from log-probabilities) and then take the exponent.

In [None]:
def perplexity(lm, lines, min_logprob=np.log(10 ** -7.)):
    """
    :param lines: a list of strings with space-separated tokens
    :param min_logprob: if log(P(w | ...)) is smaller than min_logprop, set it equal to min_logrob
    :returns: corpora-level perplexity - a single scalar number from the formula above
    
    Note: do not forget to compute P(w_first | empty) and P(eos | full_sequence)
    
    PLEASE USE lm.get_next_token_prob and NOT lm.get_possible_next_tokens
    """
    total_length = 0
    log_pp = 0

    for line in tqdm(lines):
        tokens = [''] + line.split(' ') + [EOS]

        for t in range(1, len(tokens)):
            prefix = ' '.join(tokens[:t])
            log_pp += max(
                min_logprob, np.log(lm.get_next_token_prob(prefix, tokens[t]))
            )
            total_length += 1
    
    return np.exp(-( 1 / total_length) * log_pp)

In [None]:
lm1 = NGramLanguageModel(dummy_lines, n=1)
lm3 = NGramLanguageModel(dummy_lines, n=3)
lm10 = NGramLanguageModel(dummy_lines, n=10)

ppx1 = perplexity(lm1, dummy_lines)
ppx3 = perplexity(lm3, dummy_lines)
ppx10 = perplexity(lm10, dummy_lines)
ppx_missing = perplexity(lm3, ['the jabberwock , with eyes of flame , '])  # thanks, L. Carrol

print("Perplexities: ppx1=%.3f ppx3=%.3f ppx10=%.3f" % (ppx1, ppx3, ppx10))

assert all(0 < ppx < 500 for ppx in (ppx1, ppx3, ppx10)), "perplexity should be nonnegative and reasonably small"
assert ppx1 > ppx3 > ppx10, "higher N models should overfit and "
assert np.isfinite(ppx_missing) and ppx_missing > 10 ** 6, "missing words should have large but finite perplexity. " \
    " Make sure you use min_logprob right"
assert np.allclose([ppx1, ppx3, ppx10], (318.2132342216302, 1.5199996213739575, 1.1838145037901249))

N-grams: 100%|██████████| 100/100 [00:00<00:00, 35078.23it/s]
100%|██████████| 1/1 [00:00<00:00, 1431.01it/s]
N-grams: 100%|██████████| 100/100 [00:00<00:00, 20339.96it/s]
100%|██████████| 2086/2086 [00:00<00:00, 408179.06it/s]
N-grams: 100%|██████████| 100/100 [00:00<00:00, 10437.75it/s]
100%|██████████| 2703/2703 [00:00<00:00, 451987.55it/s]
100%|██████████| 100/100 [00:00<00:00, 9867.56it/s]
100%|██████████| 100/100 [00:00<00:00, 7024.34it/s]
100%|██████████| 100/100 [00:00<00:00, 7639.48it/s]
  min_logprob, np.log(lm.get_next_token_prob(prefix, tokens[t]))
100%|██████████| 1/1 [00:00<00:00, 2931.03it/s]

Perplexities: ppx1=318.213 ppx3=1.520 ppx10=1.184





Now let's measure the actual perplexity: we'll split the data into train and test and score model on test data only.

In [None]:
from sklearn.model_selection import train_test_split
train_lines, test_lines = train_test_split(lines, test_size=0.25, random_state=42)

for n in (1, 2, 3):
    lm = NGramLanguageModel(n=n, lines=train_lines)
    ppx = perplexity(lm, test_lines)
    print("N = %i, Perplexity = %.5f" % (n, ppx))


N-grams: 100%|██████████| 30750/30750 [00:06<00:00, 4816.75it/s]
100%|██████████| 1/1 [00:00<00:00, 21.03it/s]
  min_logprob, np.log(lm.get_next_token_prob(prefix, tokens[t]))
100%|██████████| 10250/10250 [00:18<00:00, 553.45it/s]


N = 1, Perplexity = 897.20992


N-grams: 100%|██████████| 30750/30750 [00:08<00:00, 3726.91it/s]
100%|██████████| 54176/54176 [00:00<00:00, 81148.08it/s]
100%|██████████| 10250/10250 [00:19<00:00, 526.30it/s]


N = 2, Perplexity = 357.77281


N-grams: 100%|██████████| 30750/30750 [00:12<00:00, 2464.20it/s]
100%|██████████| 1005464/1005464 [00:06<00:00, 165099.61it/s]
100%|██████████| 10250/10250 [00:22<00:00, 447.86it/s]

N = 3, Perplexity = 5890.55349





In [None]:
# whoops, it just blew up :)

### LM Smoothing

The problem with our simple language model is that whenever it encounters an n-gram it has never seen before, it assigns it with the probabilitiy of 0. Every time this happens, perplexity explodes.

To battle this issue, there's a technique called __smoothing__. The core idea is to modify counts in a way that prevents probabilities from getting too low. The simplest algorithm here is Additive smoothing (aka [Lapace smoothing](https://en.wikipedia.org/wiki/Additive_smoothing)):

$$ P(w_t | prefix) = { Count(prefix, w_t) + \delta \over \sum_{\hat w} (Count(prefix, \hat w) + \delta) } $$

If counts for a given prefix are low, additive smoothing will adjust probabilities to a more uniform distribution. Not that the summation in the denominator goes over _all words in the vocabulary_.

Here's an example code we've implemented for you:

In [None]:
class LaplaceLanguageModel(NGramLanguageModel): 
    """ this code is an example, no need to change anything """
    def __init__(self, lines, n, delta=1.0):
        self.n = n
        counts = count_ngrams(lines, self.n)
        self.vocab = set(token for token_counts in counts.values() for token in token_counts)
        self.probs = defaultdict(Counter)

        for prefix in counts:
            token_counts = counts[prefix]
            total_count = sum(token_counts.values()) + delta * len(self.vocab)
            self.probs[prefix] = {token: (token_counts[token] + delta) / total_count
                                          for token in token_counts}
    def get_possible_next_tokens(self, prefix):
        token_probs = super().get_possible_next_tokens(prefix)
        missing_prob_total = 1.0 - sum(token_probs.values())
        missing_prob = missing_prob_total / max(1, len(self.vocab) - len(token_probs))
        return {token: token_probs.get(token, missing_prob) for token in self.vocab}
    
    def get_next_token_prob(self, prefix, next_token):
        token_probs = super().get_possible_next_tokens(prefix)
        if next_token in token_probs:
            return token_probs[next_token]
        else:
            missing_prob_total = 1.0 - sum(token_probs.values())
            missing_prob_total = max(0, missing_prob_total) # prevent rounding errors
            return missing_prob_total / max(1, len(self.vocab) - len(token_probs))
        

**Disclaimer**: the implementation above assumes all words unknown within a given context to be equally likely, *as well as the words outside of vocabulary*. Therefore, its' perplexity will be lower than it should when encountering such words. Therefore, comparing it with a model with less unknown words will not be fair. When implementing your own smoothing, you may handle this by adding a virtual `UNK` token of non-zero probability. Technically, this will result in a model where probabilities do not add up to $1$, but it is close enough for a practice excercise.

In [None]:
#test that it's a valid probability model
for n in (1, 2, 3):
    dummy_lm = LaplaceLanguageModel(dummy_lines, n=n)
    assert np.allclose(sum([dummy_lm.get_next_token_prob('a', w_i) for w_i in dummy_lm.vocab]), 1), "I told you not to break anything! :)"

N-grams: 100%|██████████| 100/100 [00:00<00:00, 13397.76it/s]
N-grams: 100%|██████████| 100/100 [00:00<00:00, 10160.37it/s]
N-grams: 100%|██████████| 100/100 [00:00<00:00, 13385.79it/s]


In [None]:
for n in (1, 2, 3):
    lm = LaplaceLanguageModel(train_lines, n=n, delta=0.1)
    ppx = perplexity(lm, test_lines)
    print("N = %i, Perplexity = %.5f" % (n, ppx))

N-grams: 100%|██████████| 30750/30750 [00:05<00:00, 5828.62it/s]
100%|██████████| 10250/10250 [00:25<00:00, 398.22it/s]


N = 1, Perplexity = 897.42411


N-grams: 100%|██████████| 30750/30750 [00:08<00:00, 3744.95it/s]
100%|██████████| 10250/10250 [00:23<00:00, 429.48it/s]


N = 2, Perplexity = 470.48021


N-grams: 100%|██████████| 30750/30750 [00:12<00:00, 2519.30it/s]
100%|██████████| 10250/10250 [00:23<00:00, 433.52it/s]

N = 3, Perplexity = 3679.44765





## Neural network

In [None]:
!pip install datasets transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.10.1-py3-none-any.whl (469 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m469.0/469.0 KB[0m [31m18.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m112.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub<1.0.0,>=0.2.0
  Downloading huggingface_hub-0.13.1-py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.2/199.2 KB[0m [31m23.8 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess
  Downloading multiprocess-0.70.14-py39-none-any.whl (132 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.9/132.9 KB[0m [31m19.7 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.7,>=0.3.0
  Downloading dill-0.3.6-p

To be able to push model into hugginfface you need to get token:
https://huggingface.co/docs/hub/security-tokens


In [None]:
# !huggingface-cli login --token <TODO>

## Preparing the dataset

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm
%matplotlib inline

For each of those tasks, we will use the [Wikitext 2]() dataset as an example. You can load it very easily with the 🤗 Datasets library.

In [None]:
# Alternative manual download link: https://yadi.sk/d/_nGyU2IajjR9-w
!wget "https://www.dropbox.com/s/99az9n1b57qkd9j/arxivData.json.tar.gz?dl=1" -O arxivData.json.tar.gz
!tar -xvzf arxivData.json.tar.gz

--2023-03-12 20:25:41--  https://www.dropbox.com/s/99az9n1b57qkd9j/arxivData.json.tar.gz?dl=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.65.18, 2620:100:601d:18::a27d:512
Connecting to www.dropbox.com (www.dropbox.com)|162.125.65.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /s/dl/99az9n1b57qkd9j/arxivData.json.tar.gz [following]
--2023-03-12 20:25:41--  https://www.dropbox.com/s/dl/99az9n1b57qkd9j/arxivData.json.tar.gz
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc2913565662b69bd297ddb95bbb.dl.dropboxusercontent.com/cd/0/get/B4Hq12osGMkofoTWF4gR0zmaevZsb0xTF09tvhA7DljcmjA1sPrVzdXpKiofzMTbQx04m5X_wxC9A4qQ7kR2h4rBkkW9P2tF8W3sXRL3_5U6zef25PUyrXsa3MvKTopb0F92xyNbp1XaQhSHGwxa-bX0WjxZ53E1w03TPv22lK2PtQ/file?dl=1# [following]
--2023-03-12 20:25:42--  https://uc2913565662b69bd297ddb95bbb.dl.dropboxusercontent.com/cd/0/get/B4Hq12osGMkofoTWF4gR0zmaevZsb0xTF09tvhA7DljcmjA1s

In [None]:
data = pd.read_json("./arxivData.json")
# assemble lines: concatenate title and description
lines = data.apply(lambda row: row['title'] + ' ; ' + row['summary'], axis=1).tolist()

sorted(lines, key=len)[:3]

# Task: convert lines (in-place) into strings of space-separated tokens. import & use WordPunctTokenizer
from nltk import WordPunctTokenizer
tokenizer = WordPunctTokenizer()
lines = [
    ' '.join(
        tokenizer.tokenize(line.lower())
    ) for line in tqdm(lines)
]

  0%|          | 0/41000 [00:00<?, ?it/s]

In [None]:
train, valid = train_test_split(lines, test_size=0.2)
lm_datasets = {'train' : train, 'valid' : valid}

In [None]:
from datasets import Dataset
my_dict = {"text": lines}
datasets = Dataset.from_dict(my_dict)
tr_test_datasets = datasets.train_test_split(test_size=0.1)

You can also load datasets from a csv or a JSON file, see the [full documentation](https://huggingface.co/docs/datasets/loading_datasets.html#from-local-files) for more information.

To access an actual element, you need to select a split first, then give an index:

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [None]:
from datasets import ClassLabel
import random
import pandas as pd
from IPython.display import display, HTML
from sklearn.model_selection import train_test_split

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

As we can see, some of the texts are a full paragraph of a Wikipedia article while others are just titles or empty lines.

## Causal Language modeling

For causal language modeling (CLM) we are going to take all the texts in our dataset and concatenate them after they are tokenized. Then we will split them in examples of a certain sequence length. This way the model will receive chunks of contiguous text that may look like:
```
part of text 1
```
or 
```
end of text 1 [BOS_TOKEN] beginning of text 2
```
depending on whether they span over several of the original texts in the dataset or not. The labels will be the same as the inputs, shifted to the left.

We will use the [`gpt2`](https://huggingface.co/gpt2) architecture for this example. You can pick any of the checkpoints listed [here](https://huggingface.co/models?filter=causal-lm) instead. For the tokenizer, you can replace the checkpoint by the one you trained yourself.

In [None]:
clm_model_checkpoint = "gpt2"
clm_tokenizer_checkpoint = "gpt2"

To tokenize all our texts with the same vocabulary that was used when training the model, we have to download a pretrained tokenizer. This is all done by the `AutoTokenizer` class:

In [None]:
from transformers import GPT2Tokenizer, GPT2Model, AutoModelForCausalLM
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
# model = AutoModelForCausalLM.from_pretrained('gpt2')
model = AutoModelForCausalLM.from_pretrained('zaaabik/gpt2-arxiv-clm')

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/907 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/510M [00:00<?, ?B/s]

We can now call the tokenizer on all our texts. This is very simple, using the [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) method from the Datasets library. First we define a function that call the tokenizer on our texts:

In [None]:
def tokenize_function(examples):
    return tokenizer(examples["text"])

Then we apply it to all the splits in our `datasets` object, using `batched=True` and 4 processes to speed up the preprocessing. We won't need the `text` column afterward, so we discard it.

In [None]:
tokenized_datasets = tr_test_datasets.map(tokenize_function, 
                                          batched=True, num_proc=4, 
                                          remove_columns=["text"])

Map (num_proc=4):   0%|          | 0/36900 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/4100 [00:00<?, ? examples/s]

If we now look at an element of our datasets, we will see the text have been replaced by the `input_ids` the model will need:

Now for the harder part: we need to concatenate all our texts together then split the result in small chunks of a certain `block_size`. To do this, we will use the `map` method again, with the option `batched=True`. This option actually lets us change the number of examples in the datasets by returning a different number of examples than we got. This way, we can create our new samples from a batch of examples.

First, we grab the maximum length our model was pretrained with. This might be a big too big to fit in your GPU RAM, so here we take a bit less at just 128.

In [None]:
# block_size = tokenizer.model_max_length
block_size = 128

Then we write the preprocessing function that will group our texts:

In [None]:
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

First note that we duplicate the inputs for our labels. This is because the model of the 🤗 Transformers library apply the shifting to the right, so we don't need to do it manually.

Also note that by default, the `map` method will send a batch of 1,000 examples to be treated by the preprocessing function. So here, we will drop the remainder to make the concatenated tokenized texts a multiple of `block_size` every 1,000 examples. You can adjust this behavior by passing a higher batch size (which will also be processed slower). You can also speed-up the preprocessing by using multiprocessing:

In [None]:
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)

Map (num_proc=4):   0%|          | 0/36900 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/4100 [00:00<?, ? examples/s]

And we can check our datasets have changed: now the samples contain chunks of `block_size` contiguous tokens, potentially spanning over several of our original texts.

In [None]:
lm_datasets['train'][1]['labels']

[318,
 2395,
 9383,
 14000,
 837,
 435,
 89,
 16288,
 10040,
 837,
 416,
 543,
 262,
 3632,
 14091,
 29371,
 4263,
 357,
 285,
 380,
 1267,
 423,
 587,
 5625,
 355,
 281,
 5128,
 329,
 262,
 3781,
 764,
 356,
 5150,
 257,
 649,
 5361,
 4326,
 543,
 11583,
 3033,
 422,
 1811,
 19887,
 286,
 257,
 24637,
 4991,
 286,
 262,
 4220,
 7679,
 290,
 552,
 1769,
 257,
 39279,
 300,
 17,
 532,
 2593,
 329,
 1306,
 7679,
 764,
 356,
 16726,
 262,
 5150,
 3164,
 319,
 734,
 1180,
 269,
 20471,
 45619,
 290,
 1271,
 286,
 2968,
 18335,
 40522,
 764,
 262,
 11992,
 2482,
 10176,
 262,
 9098,
 2694,
 286,
 262,
 5150,
 3164,
 764,
 6381,
 298,
 27499,
 6268,
 837,
 36282,
 290,
 4469,
 3033,
 287,
 5861,
 2162,
 428,
 3348,
 20718,
 281,
 555,
 16668,
 16149,
 9355,
 284,
 7925,
 5026,
 31589,
 5527,
 3033,
 329,
 2008,
 10552,
 764,
 7867,
 416,
 703,
 262]

In [None]:
lm_datasets['train'][1]['input_ids'][:10]

[318, 2395, 9383, 14000, 837, 435, 89, 16288, 10040, 837]

In [None]:
from transformers import Trainer, TrainingArguments

In [None]:
training_args = TrainingArguments(
    f"gpt2-arxiv-clm",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    # push_to_hub=True
)

The last argument to setup everything so we can push the model to the [Hub](https://huggingface.co/models) regularly during training. Remove it if you didn't follow the installation steps at the top of the notebook. If you want to save your model locally in a name that is different than the name of the repository it will be pushed, or if you want to push your model under an organization and not your name space, use the `hub_model_id` argument to set the repo name (it needs to be the full name, including your namespace: for instance `"sgugger/gpt-finetuned-wikitext2"` or `"huggingface/gpt-finetuned-wikitext2"`).

We pass along all of those to the `Trainer` class:

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets['train'],
    eval_dataset=lm_datasets['test'],
)

And we can train our model:

In [None]:
# trainer.train()

Once the training is completed, we can evaluate our model and get its perplexity on the validation set like this:

In [None]:
import math
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

***** Running Evaluation *****
  Num examples = 6561
  Batch size = 8


Perplexity: 36.17


The perplexity is still quite high since for this demo we trained on a small dataset for a small number of epochs. For a real LM training, you  would need a larger dataset and more epochs.

You can now upload the result of the training to the Hub, just execute this instruction:

You can now share this model with all your friends, family, favorite pets: they can all load it with the identifier `"your-username/the-name-you-picked"` so for instance:

```python
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("sgugger/my-awesome-model")
```

In [None]:
# import math
# eval_results = trainer.evaluate()
# print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

In [None]:
# tokenizer.push_to_hub(
#     'gpt2-arxiv-clm'
# )

In [None]:
from transformers import pipeline
generator = pipeline(
    'text-generation', 
    model = 'zaaabik/gpt2-arxiv-clm',
    tokenizer = tokenizer
)

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--zaaabik--gpt2-arxiv-clm/snapshots/ef1a6836cf102932080b6a5d85078536391094fb/config.json
Model config GPT2Config {
  "_name_or_path": "zaaabik/gpt2-arxiv-clm",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length

In [None]:
generator('This paper')

Generate config GenerationConfig {
  "bos_token_id": 50256,
  "do_sample": true,
  "eos_token_id": 50256,
  "max_length": 50,
  "transformers_version": "4.26.1"
}

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'This paper considers four different problems, one of which is the problem of minimizing $\\ sparsity for any given class of samples. first, two problems are addressed : we consider the problem of reducing $ p_1, p_2 \\ leq ('}]

In [None]:
from transformers import pipeline
generator_not_trained = pipeline(
    'text-generation', 
    model = 'gpt2'
)

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--gpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/config.json
Model config GPT2Config {
  "_name_or_path": "gpt2",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_vers

Downloading pytorch_model.bin:   0%|          | 0.00/548M [00:00<?, ?B/s]

loading weights file pytorch_model.bin from cache at /root/.cache/huggingface/hub/models--gpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/pytorch_model.bin
Generate config GenerationConfig {
  "bos_token_id": 50256,
  "eos_token_id": 50256,
  "transformers_version": "4.26.1"
}

All model checkpoint weights were used when initializing GPT2LMHeadModel.

All the weights of GPT2LMHeadModel were initialized from the model checkpoint at gpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use GPT2LMHeadModel for predictions without further training.


Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

loading configuration file generation_config.json from cache at /root/.cache/huggingface/hub/models--gpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/generation_config.json
Generate config GenerationConfig {
  "_from_model_config": true,
  "bos_token_id": 50256,
  "eos_token_id": 50256,
  "transformers_version": "4.26.1"
}

Could not locate the tokenizer configuration file, will try to use the model config instead.
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--gpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/config.json
Model config GPT2Config {
  "_name_or_path": "gpt2",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions":

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

loading file vocab.json from cache at /root/.cache/huggingface/hub/models--gpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/vocab.json
loading file merges.txt from cache at /root/.cache/huggingface/hub/models--gpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/merges.txt
loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--gpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/tokenizer.json
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at None
loading file tokenizer_config.json from cache at None
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--gpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/config.json
Model config GPT2Config {
  "_name_or_path": "gpt2",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_ran

In [None]:
generator_not_trained('This paper')

Generate config GenerationConfig {
  "bos_token_id": 50256,
  "do_sample": true,
  "eos_token_id": 50256,
  "max_length": 50,
  "transformers_version": "4.26.1"
}

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'This paper will provide an overview of the basic and experimental approaches to developing a network of distributed databases. The aim of the paper is to draw attention to the importance of the data in constructing a highly scalable, open-source database system: a protocol that'}]

## Masked language modeling

For masked language modeling (MLM) we are going to use the same preprocessing as before for our dataset with one additional step: we will randomly mask some tokens (by replacing them by `[MASK]`) and the labels will be adjusted to only include the masked tokens (we don't have to predict the non-masked tokens). If you use a tokenizer you trained yourself, make sure the `[MASK]` token is among the special tokens you passed during training!

We will use the [`bert-base-cased`](https://huggingface.co/bert-based-cased) model for this example. You can pick any of the checkpoints listed [here](https://huggingface.co/models?filter=masked-lm) instead. For the tokenizer, replace the checkpoint by the one you trained.

In [None]:
model_checkpoint = "bert-base-cased"
tokenizer_checkpoint = "bert-base-cased"

We can apply the same tokenization function as before, we just need to update our tokenizer to use the checkpoint we just picked:

In [None]:
from transformers import BertTokenizer, BertModel, BertForMaskedLM

In [None]:
tokenizer = BertTokenizer.from_pretrained(tokenizer_checkpoint)
tokenized_datasets = tr_test_datasets.map(tokenize_function, 
                                  batched=True, num_proc=4, 
                                  remove_columns=["text"])

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

loading file vocab.txt from cache at /root/.cache/huggingface/hub/models--bert-base-cased/snapshots/5532cc56f74641d4bb33641f5c76a55d11f846e0/vocab.txt
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at None
loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--bert-base-cased/snapshots/5532cc56f74641d4bb33641f5c76a55d11f846e0/tokenizer_config.json


Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--bert-base-cased/snapshots/5532cc56f74641d4bb33641f5c76a55d11f846e0/config.json
Model config BertConfig {
  "_name_or_path": "bert-base-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.26.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}



Map (num_proc=4):   0%|          | 0/36900 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (693 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (531 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (558 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (529 > 512). Running this sequence through the model will result in indexing errors


Map (num_proc=4):   0%|          | 0/4100 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (581 > 512). Running this sequence through the model will result in indexing errors


And like before, we group texts together and chunk them in samples of length `block_size`. You can skip that step if your dataset is composed of individual sentences.

In [None]:
mlm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)

Map (num_proc=4):   0%|          | 0/36900 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/4100 [00:00<?, ? examples/s]

The rest is very similar to what we had, with two exceptions. First we use a model suitable for masked LM:

We redefine our `TrainingArguments`:

In [None]:
training_args = TrainingArguments(
    "bert-base-cased-arxiv-mlm",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    # push_to_hub=True # add this line to push your model during training 
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


Like before, the last two arguments are to setup everything so we can push the model to the [Hub](https://huggingface.co/models) at the end of training. Remove the two of them if you didn't follow the installation steps at the top of the notebook, otherwise you can change the value of `push_to_hub_model_id` to something you would prefer.

Finally, we use a special `data_collator`. The `data_collator` is a function that is responsible of taking the samples and batching them in tensors. In the previous example, we had nothing special to do, so we just used the default for this argument. Here we want to do the random-masking. We could do it as a pre-processing step (like the tokenization) but then the tokens would always be masked the same way at each epoch. By doing this step inside the `data_collator`, we ensure this random masking is done in a new way each time we go over the data.

To do this masking for us, the library provides a `DataCollatorForLanguageModeling`. We can adjust the probability of the masking:

In [None]:
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

Then we just have to pass everything to `Trainer` and begin training:

In [None]:
from transformers import Trainer

In [None]:
model = BertForMaskedLM.from_pretrained('zaaabik/bert-base-cased-arxiv-mlm')

Downloading (…)lve/main/config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--zaaabik--bert-base-cased-arxiv-mlm/snapshots/2321ff2750cc6dff453c8a11fddaffe910ac12ac/config.json
Model config BertConfig {
  "_name_or_path": "bert-base-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "torch_dtype": "float32",
  "transformers_version": "4.26.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}



Downloading pytorch_model.bin:   0%|          | 0.00/433M [00:00<?, ?B/s]

loading weights file pytorch_model.bin from cache at /root/.cache/huggingface/hub/models--zaaabik--bert-base-cased-arxiv-mlm/snapshots/2321ff2750cc6dff453c8a11fddaffe910ac12ac/pytorch_model.bin
Generate config GenerationConfig {
  "pad_token_id": 0,
  "transformers_version": "4.26.1"
}

All model checkpoint weights were used when initializing BertForMaskedLM.

All the weights of BertForMaskedLM were initialized from the model checkpoint at zaaabik/bert-base-cased-arxiv-mlm.
If your task is similar to the task the model of the checkpoint was trained on, you can already use BertForMaskedLM for predictions without further training.
Generation config file not found, using a generation config created from the model config.


In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=mlm_datasets['train'],
    eval_dataset=mlm_datasets['test'],
    data_collator=data_collator,
)

In [None]:
# trainer.train()

Like before, we can evaluate our model on the validation set. The perplexity is much lower than for the CLM objective because for the MLM objective, we only have to make predictions for the masked tokens (which represent 15% of the total here) while having access to the rest of the tokens. It's thus an easier task for the model.

In [None]:
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

***** Running Evaluation *****
  Num examples = 7036
  Batch size = 8


Perplexity: 7.67


The perplexity is still quite high since for this demo we trained on a small dataset for a small number of epochs. For a real LM training, you  would need a larger dataset and more epochs.

You can now upload the result of the training to the Hub, just execute this instruction:

In [None]:
from transformers import pipeline
our_mlm_model = pipeline(
    'fill-mask', 
    model = 'zaaabik/bert-base-cased-arxiv-mlm',
    tokenizer= BertTokenizer.from_pretrained('bert-base-cased')
)

loading file vocab.txt from cache at /root/.cache/huggingface/hub/models--bert-base-cased/snapshots/5532cc56f74641d4bb33641f5c76a55d11f846e0/vocab.txt
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at None
loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--bert-base-cased/snapshots/5532cc56f74641d4bb33641f5c76a55d11f846e0/tokenizer_config.json
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--bert-base-cased/snapshots/5532cc56f74641d4bb33641f5c76a55d11f846e0/config.json
Model config BertConfig {
  "_name_or_path": "bert-base-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embedd

In [None]:
our_mlm_model('The [MASK] model use convolution function between data and filters to get new data representation')

[{'score': 0.5451073050498962,
  'token': 3000,
  'token_str': 'proposed',
  'sequence': 'The proposed model use convolution function between data and filters to get new data representation'},
 {'score': 0.037218332290649414,
  'token': 1207,
  'token_str': 'new',
  'sequence': 'The new model use convolution function between data and filters to get new data representation'},
 {'score': 0.017436446622014046,
  'token': 3694,
  'token_str': 'resulting',
  'sequence': 'The resulting model use convolution function between data and filters to get new data representation'},
 {'score': 0.011302039958536625,
  'token': 3776,
  'token_str': 'learning',
  'sequence': 'The learning model use convolution function between data and filters to get new data representation'},
 {'score': 0.009850174188613892,
  'token': 2013,
  'token_str': 'training',
  'sequence': 'The training model use convolution function between data and filters to get new data representation'}]

In [None]:
lml_pipeline = pipeline('fill-mask', 
                        'bert-base-cased')

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--bert-base-cased/snapshots/5532cc56f74641d4bb33641f5c76a55d11f846e0/config.json
Model config BertConfig {
  "_name_or_path": "bert-base-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.26.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--bert-base-cased/snapshots/5532cc56f74641d4bb33641f5c76a55d11f846e0/config.json
Mod

Downloading pytorch_model.bin:   0%|          | 0.00/436M [00:00<?, ?B/s]

loading weights file pytorch_model.bin from cache at /root/.cache/huggingface/hub/models--bert-base-cased/snapshots/5532cc56f74641d4bb33641f5c76a55d11f846e0/pytorch_model.bin
Generate config GenerationConfig {
  "pad_token_id": 0,
  "transformers_version": "4.26.1"
}

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of BertForMaskedLM were initialized from the model checkpoint at bert-base-cased.
If

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

loading file vocab.txt from cache at /root/.cache/huggingface/hub/models--bert-base-cased/snapshots/5532cc56f74641d4bb33641f5c76a55d11f846e0/vocab.txt
loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--bert-base-cased/snapshots/5532cc56f74641d4bb33641f5c76a55d11f846e0/tokenizer.json
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at None
loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--bert-base-cased/snapshots/5532cc56f74641d4bb33641f5c76a55d11f846e0/tokenizer_config.json
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--bert-base-cased/snapshots/5532cc56f74641d4bb33641f5c76a55d11f846e0/config.json
Model config BertConfig {
  "_name_or_path": "bert-base-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu"

In [None]:
lml_pipeline('The [MASK] model use convolution function between data and filters to get new data representation')

[{'score': 0.03559460863471031,
  'token': 12123,
  'token_str': 'filter',
  'sequence': 'The filter model use convolution function between data and filters to get new data representation'},
 {'score': 0.022837158292531967,
  'token': 1378,
  'token_str': 'following',
  'sequence': 'The following model use convolution function between data and filters to get new data representation'},
 {'score': 0.019685106351971626,
  'token': 2985,
  'token_str': 'latter',
  'sequence': 'The latter model use convolution function between data and filters to get new data representation'},
 {'score': 0.017502127215266228,
  'token': 1207,
  'token_str': 'new',
  'sequence': 'The new model use convolution function between data and filters to get new data representation'},
 {'score': 0.017112644389271736,
  'token': 1269,
  'token_str': 'same',
  'sequence': 'The same model use convolution function between data and filters to get new data representation'}]