<p style="align: center;"><img align=center src="https://s8.hostingkartinok.com/uploads/images/2018/08/308b49fcfbc619d629fe4604bceb67ac.jpg" style="height:450px;" width=500/></p>

<h3 style="text-align: center;"><b>Школа глубокого обучения ФПМИ МФТИ</b></h3>
<h3 style="text-align: center;"><b>Продвинутый поток (часть 2). Весна 2021</b></h3>

<h1 style="text-align: center;"><b>Language modeling.</b></h1>

## Installation and Dataset

Для начала загрузим датасет, состоящий из сэмплов кода на языке Python. Датасет представлен гитхабом. [Про датасет](https://github.blog/2019-09-26-introducing-the-codesearchnet-challenge/).

Для препроцессинга будем использовать уже известную нам библиотеку `datasets` от Huggingface.

In [1]:
!pip install -q datasets

In [2]:
!wget https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/python.zip
!unzip -p python.zip python/final/jsonl/train/python_train_0.jsonl.gz > train.jsonl.gz
!unzip -p python.zip python/final/jsonl/test/python_test_0.jsonl.gz > test.jsonl.gz

--2021-03-29 10:51:11--  https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/python.zip
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.108.125
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.108.125|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 940909997 (897M) [application/zip]
Saving to: ‘python.zip.1’


2021-03-29 10:52:09 (15.6 MB/s) - ‘python.zip.1’ saved [940909997/940909997]



In [None]:
# decompress this gzip file
!gzip -d train.jsonl.gz
!gzip -d test.jsonl.gz

gzip: train.jsonl already exists; do you wish to overwrite (y or n)? 

Загружать датасеты можно не только из хаба, но и из диска. Для этого достаточно указать формат и путь до файла.

Datasets can be downloaded not only from the Hub, but also from Drive. To do this, specify file's format and path.

In [None]:
from datasets import load_dataset  
dataset = load_dataset(
    "json",
    data_files=[
        "train.jsonl",
    ],
)

In [None]:
dataset

Ограничим число уникальных слов до `40000`.

Bound the number of uniqie words by `40000`.

In [None]:
import tqdm
from collections import Counter


vocab_size = 40000
stats = Counter()

# stats is like word2freq dictionary
for item in tqdm.tqdm(dataset["train"]):
    stats.update(item["code_tokens"])

# select vocab_size most common words from stats
#   and extract the keys, they will be our vokabulary and consequently - our tokens
tokens = dict(stats.most_common(vocab_size)).keys()

Have a look at 20 most frequent words. These were expected becouuse the data contains python code.

In [None]:
stats.most_common(20)

1. Add *service* tokens "[PAD]", "[UNK]", "[EOS]" (end of sentense).

2. Make `token2idx` dictionary (just enumerate them all)

3. Make`idx2tiken` - inverse of `token2idx`

In [None]:
# service tokens
PAD = 0
UNK = 1
EOS = 2
token2idx = {"[PAD]": PAD, "[UNK]": UNK, "[EOS]": EOS}

# token2idx
for idx, token in enumerate(tokens):
    token2idx[token] = idx + 3

# token2idx
idx2token = {idx: token for token, idx in token2idx.items()}

Let's make a function which encodes the tokens as indexes.

In [None]:
def encode(token):
    """
    returns
        for known tokens - their indexes
        for unknown tokens - index of the '[UNK]' token
    """
    if token in token2idx.keys():
        return token2idx[token]
    return UNK

Encode tokens as indexes in the train dataset.

In [None]:
dataset = dataset.map(
    lambda item: {
        "features": [encode(token) for token in item["code_tokens"]] + [EOS]
    }
)

## N-gram

 Наченм с простейшей модели. Она основывается на статистическом методе. Итак, в языковом моделировании мы хотим максимизировать вероятность нашего текста по мнению модели, то есть:
 $$
\mathrm{P}(\mathrm{W})=\mathrm{P}\left(\mathrm{w}_{1}, \mathrm{w}_{2}, \mathrm{w}_{3}, \mathrm{w}_{4}, \mathrm{w}_{5} \ldots \mathrm{w}_{\mathrm{n}}\right)
$$


Вспомним, что можно переписать:

$$
P\left(x_{1}, x_{2}, x_{3}, \ldots, x_{n}\right)=P\left(x_{1}\right) P\left(x_{2} \mid x_{1}\right) P\left(x_{3} \mid x_{1}, x_{2}\right) \ldots P\left(x_{n} \mid x_{1}, \ldots, x_{n-1}\right)
$$

Тогда:

$$
P\left(w_{1} w_{2} \ldots w_{n}\right)=\prod_{i} P\left(w_{i} \mid w_{1} w_{2} \ldots w_{i-1}\right)
$$

Однако число вероятностей вида $P\left(w_{i} \mid w_{1} w_{2} \ldots w_{i-1}\right)$ растет очень быстро. Поэтому используют некоторое предположение которое называется **марковковское приближение**. Формулируется оно так:

$$
P\left(w_{1} w_{2} \ldots w_{n}\right) \approx \prod_{i} P\left(w_{i} \mid w_{i-k} \ldots w_{i-1}\right)
$$

То есть мы считаем, что текущее слово зависит только от $k$ предыдущих.

$$
P\left(w_{i} \mid w_{1} w_{2} \ldots w_{i-1}\right) \approx P\left(w_{i} \mid w_{i-k} \ldots w_{i-1}\right)
$$


In [None]:
import numpy as np
from collections import Counter, defaultdict

from tqdm.notebook import tqdm


class NGramModel(object):
    """
    Структура этой реализации n-граммной модели следующая:
    self.ngrams – словарь, который на каждый (token_0, ..., token_(n-1)) – n-1 tuple из токенов
        хранит частоту появления следующего токена. Для подсчета числа токенов воспользуемся
        Counter
    self.tokenize_func – функция токенизации текста. С её помощью будем получать токены.
    """
    def __init__(self, n=2):
        self.n = n
        self.tokenize_func = None
        # maps tuples of tokens to their freqs
        self.ngrams = defaultdict(Counter)
        
    def compute_ngrams(self, dataset):
        self.ngrams = defaultdict(Counter)
        for row in tqdm(dataset):
            ngram = [PAD] * self.n
            for token in row["features"]:
                # shift the window towards the end of the sentence and add new token
                ngram[:-1] = ngram[1:]
                ngram[-1] = token
                self.ngrams[tuple(ngram[:-1])].update([ngram[-1]])
            
    def get_log_probs(self, prefix, min_log_pr=-15):
        """
        returns log frequences of token occurrences
        """
        # small prefix => need to pad at the beginning
        if len(prefix) < self.n - 1:
            prefix = [PAD] * (self.n - len(prefix) - 1) + prefix
        # big prefix => just take the relevant tail
        else:
            prefix = prefix[-self.n + 1:]

        possible_ends = self.ngrams[tuple(prefix)]
        sum_freq = np.log(sum(possible_ends[e] for e in possible_ends))

        # log(a/b) = log(a) - log(b); RHS is much more stable
        return {e: np.log(possible_ends[e]) - sum_freq for e in possible_ends}
    
    def sample(self, prefix):
        possible_ends = self.get_log_probs(prefix)
        if len(possible_ends) > 0:
            end = np.random.choice(list(possible_ends.keys()), p=np.exp(list(possible_ends.values())))
            return end
        return EOS

### Training

Initialize the model.

In [None]:
n_gram_model = NGramModel(n=5)

Train the model.

In [None]:
n_gram_model.compute_ngrams(dataset["train"])

### Text generation

Generate some code with the n-gram model.

In [None]:
prefix = ["def", "train", "("]
encoded_prefix = [token2idx[token] for token in prefix]
length=100

for i in range(length):
    cur_token = n_gram_model.sample(encoded_prefix)
    if cur_token == EOS:
        break
    encoded_prefix += [cur_token]


decoded_text = [idx2token[idx] for idx in encoded_prefix]
print(" ".join(decoded_text))

### Testing

In [None]:
test_dataset = load_dataset(
    "json",
    data_files=[
        "test.jsonl",
    ],
)

In [None]:
max_seq_len=128

test_dataset = test_dataset.map(
    lambda item: {
        "features": [encode(token) for token in item["code_tokens"]][:max_seq_len-1] + [EOS]
    }
)

### Evaluation metric: Perplexity (PP)

See [Perplexity in Language Models](https://towardsdatascience.com/perplexity-in-language-models-87a196019a94): 

*Intuitively, if a model assigns a high probability to the test set, it means that it is not surprised to see it (it’s **not perplexed** by it), which means that it has a good understanding of how the language works.*

$$
P P(p):=2^{H(p)}=2^{-\sum_{x} p(x) \log _{2} p(x)}
$$

From what we know of cross-entropy we can say that $H(W)$ is the **average number of bits needed to encode each word**. This means that the perplexity $2^{H(W)}$ is the **average number of words that can be encoded using $H(W)$ bits.**


**We can also use**

$$
P P' (p):=e^{H(p)}=e^{-\sum_{x} p(x) \ln p(x)}
$$

And the averaged over the number of samples version.

$$
P P'' (p):=e^{H(p)}=e^{-\frac{1}{n}\sum_{x} p(x) \ln p(x)}
$$

In [None]:
def count_perplexity(model, dataset, max_iter_num: int = 1000):
    """
    
    """
    entropy = 0
    iter_num = 0
    num_words = 0
    for item in tqdm(dataset, total=min(max_iter_num, len(dataset))):
        output_so_far = [item["features"][0]]

        for token in item["features"][1:]:
            num_words += 1
            try:
                log_probs = model.get_log_probs(output_so_far)
                entropy += -log_probs[token] # for all other words in the vocab 
            except KeyError:
                entropy += np.log(-10)
            output_so_far.append(token)
        iter_num += 1
        if iter_num > max_iter_num:
            break
    mean_entropy = entropy / num_words
    return np.e ** mean_entropy

In [None]:
count_perplexity(n_gram_model, test_dataset["train"])

## CNN

![](https://lena-voita.github.io/resources/lectures/lang_models/neural/cnn/cnn_main-min.png)



In [None]:
# transform lists into torch tensors (still of varying lengths)
dataset.set_format(type="torch", columns=["features"])
test_dataset.set_format(type="torch", columns=["features"])

In [None]:
def collate_fn(batch):
    """
    1. Takes a batch as an argument
    2. Extracts a list of tensors (featuers) of varying lengths as a batch['features']
    3. Padds them so all the tensors in a batch have the same lengths and puts them 
        all in one tensor `input_embeds`

    returns: a dictionary {"features": input_embeds}
    """
    batch = batch[0]
    max_len = max(len(f_t) for f_t in batch["features"])
    input_embeds = torch.zeros((len(batch["features"]), max_len), dtype=torch.long)
    for idx, row in enumerate(batch["features"]):
        input_embeds[idx][:len(row)] += row
    return {
        "features": input_embeds,
    }

This is similar to the Dataset class, but has varying batch size.

**Q:** Why do we do this?

**A:** Our collate function can result in different number of tokens in the padded sentences in a batch. Sometimes it will be small, sometimes - large. Recall, the batch matrix groes linearly in **T**=`batch_size` $\times$ `len(longest sentence in the batch)`. If we fix the batch size and encounter a very large sentense in the middle of our training, we might run out of memory and the training will just crash. To avoid that, we will record **T** and stop adding sentences to the batch if we acceed a certaion threshold if we include this one extra sentence. Off course, the amount of memory taken also linearly depends on the size of our `token embedding`.


In [None]:
from torch.utils.data import Sampler



class TextSampler(Sampler):
    def __init__(self, sampler, batch_size_tokens=1e4):
        self.sampler = sampler
        self.batch_size_tokens = batch_size_tokens

    def __iter__(self):
        batch = []
        max_len = 0
        for ix in self.sampler:
            row = self.sampler.data_source[ix]
            max_len = max(max_len, len(row["features"]))
            # if we acceed the number of tokens given as a threshold, we yield batch
            #   this means that we will not add the last consedered sentense to the 
            #   current batch
            if (len(batch) + 1) * max_len > self.batch_size_tokens:
                yield batch
                # after yielding the batch
                # this sentense will be recorded as the first one in the fresh batch
                batch = []
                max_len = len(row["features"])
            # in both cases (current batch is old batch or current batch is a fresh batch)
            #   we will append the sentence index to the current batch
            batch.append(ix)

        # if we ran out of sentences and have not yielded the last non-empty batch,
        #   it's time to do so now
        if len(batch) > 0:
            yield batch

    def __len__(self):
        return len(self.sampler)

In [None]:
from torch.utils.data import DataLoader, SequentialSampler, RandomSampler, random_split

# sample indexes at random every time for better training
train_sampler = RandomSampler(dataset["train"])
# more efficient sequential sempler works best when there is no need to train (validation/testing)
valid_sampler = SequentialSampler(test_dataset["train"])



loaders = {
    "train": DataLoader(
        dataset["train"],                           # dataset to use
        collate_fn=collate_fn,                      # convert list of inputs into batches
        sampler=TextSampler(sampler=train_sampler,) # Sample batches of varying batch size
    ),
    "valid": DataLoader(
        test_dataset["train"],                      # dataset to use
        collate_fn=collate_fn,                      # convert list of inputs into batches
        sampler=TextSampler(sampler=valid_sampler,) # Sample batches of varying batch size
    )
}

In [None]:
import torch
import torch.nn as nn


class CNNLM(nn.Module):
    def __init__(self, vocab_size, emb_size, hidden_size, num_layers=3, kernel_size: int = 5):
        super().__init__()
        
        self.emb = nn.Embedding(vocab_size, emb_size)
        layers = []

        # PADDING
        # YOUR CODE GOES HERE (DOWN)
        for layer_idx in range(num_layers):
            layers.append(nn.ZeroPad2d((kernel_size-1, 0, 0, 0)))
            if layer_idx == 0:
                layers.append(nn.Conv1d(emb_size, hidden_size, kernel_size=kernel_size))
            else:
                layers.append(nn.Conv1d(hidden_size, hidden_size, kernel_size=kernel_size))
        # YOUR CODE GOES HERE (UP)


        self.conv_layers = nn.Sequential(*layers)
        # for receptive_field, check the picture below (red part shows receptive field)
        self.receptive_field = kernel_size + (kernel_size-1)*(num_layers-1)
        self.pred = nn.Linear(hidden_size, vocab_size)
        
    def forward(self, input_ids):
        #print(input_ids.shape)                 # (batch_size, max_len) max_len - maximum sentence length in the batch w/o padding
        embed = self.emb(input_ids)             # (batch_size, max_len, emb_size)
        #print(embed.shape)
        embed = embed.permute(0, 2, 1)          # (batch_size, emb_size, max_len) want to convolve over embeddings for words
        #print(embed.shape)
        features = self.conv_layers(embed)      # (batch_size, hidden_size, max_len)
        #print(features.shape)
        features = features.permute(0, 2, 1)    # (batch_size, max_len, hidden_size)
        #print(features.shape)
        logits = self.pred(features)            # (batch_size, max_len, vocab_size)
        #print(logits.shape)
        return logits

In [None]:
device = "cuda:0" if torch.cuda.is_available() else "cpu"

model = CNNLM(len(tokens) + 3, 300, 100, num_layers=1).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-2)
criterion = nn.CrossEntropyLoss(ignore_index=PAD)

In [None]:
from tqdm.notebook import tqdm, trange


def train(
    num_epochs: int, 
    model: nn.Module,
    train_loader: DataLoader,
    valid_loader: DataLoader,
    criterion: nn.Module,
    optimizer: torch.optim.Optimizer,
    max_grad_norm: float = None
):
    for epoch in trange(num_epochs):
        pbar = tqdm(train_loader, leave=False, total=len(train_loader)//20)
        pbar.set_description("Train epoch")
        model.train()
        for batch in pbar:
            optimizer.zero_grad()
            features = batch["features"].to(device)

            # we do not take the last token since we will predict it
            predictions = model(features[:, :-1])
            loss = criterion(
                predictions.reshape(-1, predictions.size(-1)), # (batch_size, vocab_size)
                features[:, 1:].reshape(-1)                    # (batch_size, )
            )
            loss.backward()

            # gradient clipping
            if max_grad_norm is not None:
                torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
            optimizer.step()
        model.eval()
        mean_loss = 0
        pbar = tqdm(valid_loader, leave=False, total=len(valid_loader)//100)
        pbar.set_description("Valid epoch")
        num_iter=0
        for batch in pbar:
            features = batch["features"].to(device)
            with torch.no_grad():
                predictions = model(features[:, :-1])
                loss = criterion(
                    predictions.reshape(-1, predictions.size(-1)),
                    features[:, 1:].reshape(-1)
                )
            mean_loss += loss.item()
            num_iter += 1
        mean_loss /= num_iter
        print(f"Epoch: {epoch}; mean loss: {mean_loss}; perplexity: {np.exp(mean_loss)}")
            

In [None]:
train(
    num_epochs=1,
    model=model, 
    train_loader=loaders["train"],
    valid_loader=loaders["valid"],
    criterion=criterion,
    optimizer=optimizer,
)

![](https://lena-voita.github.io/resources/lectures/lang_models/neural/cnn/receptive_field-min.png)


Как увеличить receptive field? 

Добавить больше слоев.

Как обучать?

Добавить residual connections.


![](https://lena-voita.github.io/resources/lectures/lang_models/neural/cnn/cnn_with_residual-min.png)



### Visualizing Softmax outputs and using Temperature

**Q:** What's the deal with temperature?

**A:** It comes as a factor into softmax function both in the numerator and in the denominator and either increases the differences between the largest and the smallest values before softmax (low Temperature), or actually makes the softmax output closer to uniform distribution (high Temperature).

In [None]:
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
from ipywidgets import interactive
from IPython import display

sns.set(style="whitegrid", font_scale=1.4)

sample = np.random.randn(10)
def plot_temperature(T: float = 1.0):
    plt.figure(figsize=(12, 8))
    plt.title(f"Temperature = {T}")
    probs = np.exp(sample / T) / sum(np.exp(sample / T)) # the only neew bit (scaling softmax)
    plt.bar(range(10), probs)
    plt.xlabel("tokens")
    plt.ylabel("probs")
    plt.show()


v = interactive(
    plot_temperature, T=(0.02, 10)
)

In [None]:
display.display(v)

### Generate text

Check how we can extract receptive fild from a model. Some models won't have it.

We want to know the receptive field to optimize and gove less tokens to the model as input.

In [None]:
try:
    model.receptive_fiel
except AttributeError as e:
    print(e)

Make a function to generate text.

In [None]:
from typing import List
from torch.distributions import Categorical

@torch.no_grad()
def generate(
    prefix, model, length: int = 100, receptive_field: int = 5, T: float = 1.
) -> List[int]:
    prefix = torch.from_numpy(prefix)
    prefix = prefix.unsqueeze(0).to(device)
    model.eval()
    for iter_idx in range(length):
        # use the knowledge of the receptive_field to optimize
        #   the rest of the tokens are not used anyway
        try:
            preds = model(prefix[:, -model.receptive_field:])
        except AttributeError as e:
            print(e)
            preds = model(prefix[:, -receptive_field:])
        # print(preds.shape)                # (batch_size, max_len, vocab_size)


        # only interested in the last token before giving it into softmax
        # print(preds[:, -1, :].shape)      # (batch_size, vocab_size)

        # scale by Temperature before applying softmax on the last dimention (vocab_size)
        probs = torch.softmax(preds[:, -1, :]/T, dim=-1)
        # print(probs.shape)                # (batch_size, vocab_size)

        # to sample from discrete distribution with known probs
        #   use torch.distributions.Categorical with method .sample()
        distribution = Categorical(probs)
        sampled = distribution.sample()

        # if we reached the end of the sentence token - break
        if sampled.item() == EOS:
            break

        # record the last-generated token
        prefix = torch.cat((prefix, sampled.unsqueeze(0)), dim=1)
    return prefix

Generate text for 5 different temperatures. Note, temperature scales within the **softmax** function, so it makes sence to try the temperatures on the **log-scale** (**softmax will exponentiate it** anyway).

In [None]:
prefix = ["def", "train", "("]
encoded_prefix = np.array([token2idx[t] for t in prefix])

for t in np.logspace(0.002, 1, 10):
    generated = generate(
        encoded_prefix, 
        model, 
        receptive_field=model.receptive_field, 
        length=20,
        T=t-1
    )
    print(f"Temperature: {t-1}")
    print(" ".join([idx2token[idx] for idx in generated.cpu().numpy().flatten()]))

## LSTM

![](https://lena-voita.github.io/resources/lectures/lang_models/neural/rnn/rnn_simple-min.png)

In [None]:
class LSTM(nn.Module):
    def __init__(self, vocab_size, emb_size, hidden_size):
        super().__init__()
        self.emb = nn.Embedding(vocab_size, emb_size)
        self.lstm = nn.LSTM(emb_size, hidden_size, batch_first=True)
        self.pred = nn.Linear(hidden_size, vocab_size)
        
    def forward(self, input_ids):
        # print(input_ids.shape)    # (batch_size, max_len) max_len - maximum sentence length in the batch w/o padding
        embs = self.emb(input_ids)
        # print(embs.shape)         # (batch_size, max_len, emb_size)
        output, _ = self.lstm(embs)
        # print(output.shape)       # (batch_size, max_len, hidden_size)
        output = self.pred(output)
        #print(output.shape)         # (batch_size, max_len, vocab_size)
        # a = 1/0 # stops training after printing everything once
        return output

In [None]:
model = LSTM(len(token2idx), 300, 50).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-2)

In [None]:
train(
    num_epochs=1,
    model=model,
    train_loader=loaders["train"],
    valid_loader=loaders["valid"],
    criterion=criterion,
    optimizer=optimizer,
)

## Методы генерации текста

### Greedy Search

$$
w_t = \operatorname{argmax}_{w} P\left(w \mid w_{1: t-1}\right)
$$

![](https://huggingface.co/blog/assets/02_how-to-generate/greedy_search.png)

**Проблема**: Модель быстро начинает повторять одну и ту же фразу. 

### Beam search

![](https://huggingface.co/blog/assets/02_how-to-generate/beam_search.png)

**Проблема**: Модель все еще выдает слишком предсказуемый текст, в отличии от человеческой речи.
![](https://blog.fastforwardlabs.com/images/2019/05/Screen_Shot_2019_05_08_at_3_06_36_PM-1557342561886.png)

### Sampling

$$
w_{t} \sim P\left(w \mid w_{1: t-1}\right)
$$

![](https://huggingface.co/blog/assets/02_how-to-generate/sampling_search_with_temp.png)

**Проблема**: страдает целостность текста. Некоторые фразы получаются слишком случайные.

### Top-K Sampling


![](https://huggingface.co/blog/assets/02_how-to-generate/top_k_sampling.png)

Еще можно использовать top-p sampling. Жадно набирать слова, пока их общая вероятность не станет p. Или можно брать top-10/top-50 слов. 


In [None]:
prefix = ["def", "train", "("]
encoded_prefix = np.array([token2idx[t] for t in prefix])

generated = generate(encoded_prefix, model)

In [None]:
prefix = ["def", "train", "("]
encoded_prefix = np.array([token2idx[t] for t in prefix])


for t in np.logspace(0.002, 1, 10):
    generated = generate(
        encoded_prefix, 
        model, 
        receptive_field=20, 
        length=20,
        T=t-1
    )
    print(f"Temperature: {t-1}")
    print(" ".join([idx2token[idx] for idx in generated.cpu().numpy().flatten()]))

## References



1.   [Заметки из курса ШАДа.](https://lena-voita.github.io/nlp_course/language_modeling.html)
2.   [Блогпост по теме генерации текста от huggingface.](https://huggingface.co/blog/how-to-generate) Пока не заморачивайтесь, что там за модель в примере. Мы ее подробно рамерем в одном из следующих занятий.

