## Chit-chat Transformers Tutorial
### BERT for text generation
In this Tutorial we will learn:
* how to use Masked Language Model to sample words from pre-trained BERT
* how to make text generator from pre-trained BERT

#### 0. Install requirements and Download pre-trained BERT model

Make shure that you are using GPU (GPU is not required but it will really speed-up computations). In Colab you can choose environment with GPU.

Uncomment next cell if DeepPavlov is not installed

In [None]:
# ! pip install deeppavlov

Install BERT model implementation on Tensorflow

In [None]:
! pip install git+https://github.com/deepmipt/bert.git@feat/multi_gpu

Download BERT-base model

In [None]:
! wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip

and unpack it.

In [None]:
! unzip uncased_L-12_H-768_A-12.zip

In [None]:
BERT_MODEL_PATH = './uncased_L-12_H-768_A-12/'

#### 1. Import BERT preprocessing from DeepPavlov

In [None]:
import deeppavlov
from deeppavlov.models.preprocessors.bert_preprocessor import BertPreprocessor

In [None]:
# set up max sequence length in subtokens
max_seq_len = 20

In [None]:
# initialize bert preprocessor
bp = BertPreprocessor(vocab_file=BERT_MODEL_PATH + 'vocab.txt', do_lower_case=True, max_seq_length=max_seq_len)

`BertPreprocessor` takes two texts as input and outputs features for BERT model, those features are:
* tokens - list of subtokens with special BERT tokens: `[CLS]`, `[SEP]`
* input_ids - list of subtokens converted to indices
* input_mask - to distinguish PADded tokens from real ones. 0 - for paddings.
* input_type_ids - as we want to feed two texts, we should distinguish them. 0 - for `text_a`, 1 - for `text_b`.

Let's inspect them for sample input.

In [None]:
input_example = bp(texts_a = ['Bob is a good man.'], texts_b = ['He has three kids.'])[0]
print('tokens:', input_example.tokens)
print('input_ids:', input_example.input_ids)
print('input_mask:', input_example.input_mask)
print('input_type_ids:', input_example.input_type_ids)

#### 2. Build BERT model

In [None]:
from bert_dp import modeling

In [None]:
bert_config = modeling.BertConfig.from_json_file(BERT_MODEL_PATH + 'bert_config.json')
print('BERT model parameters:')
bert_config.to_dict()

In [None]:
import tensorflow as tf

In [None]:
# we should define placeholders for BERT model
input_ids_ph = tf.placeholder(shape=(None, None), dtype=tf.int32)
input_masks_ph = tf.placeholder(shape=(None, None), dtype=tf.int32)
token_types_ph = tf.placeholder(shape=(None, None), dtype=tf.int32)
is_train_ph = tf.placeholder_with_default(False, shape=[])

In [None]:
# this will build Tensorflow graph for BERT model
bert_model = modeling.BertModel(config=bert_config,
                                is_training=is_train_ph,
                                input_ids=input_ids_ph,
                                input_mask=input_masks_ph,
                                token_type_ids=token_types_ph,
                                use_one_hot_embeddings=False)

`bert_model` support several types of output for different tasks:
* `bert_model.get_pooled_output()` will return single vector for each input example in batch -- result of dense layer applied to the last Transformer layer output for [CLS] subtoken. This output can be used for text classification tasks.
* `bert_model.get_sequence_output()` will return tensor of shape [batch_size, seq_len, 768] -- output of the last layer for each subtoken. This output can be used for sequence tagging, question answering tasks.

In [None]:
# let`s check result of get_sequence_output
bert_model.get_sequence_output()

#### 3. Build BERT model for Masked Language Modeling task
BERT model was trained on Masked Language Modeling task, i.e. predict MASKED word by it's context.


Let's define `get_masked_lm_output` function which will return probabilies for each word in vocabulary for every MASKED word. This function takes result of `get_sequence_output()` for masked tokens and applies dense layer to them. Then we multiply this tensor of shape [batch_size, masked_tokens_n, 768] by transposed tokens embedding matrix with shape [vocabulary_size, 768]. Then softmax is applied to it giving us distribution over the vocabulary for each masked token [batch_size, masked_tokens_n, vocabulary_size].
```
softmax(dense(get_sequence_output()) * embeddings_matrix_T + bias)
```

We will take all required parameters (dense layer and biases) from pre-trained BERT model.

In [None]:
def gather_indexes(sequence_tensor, positions):
    """Gathers the vectors at the specific positions over a minibatch."""
    sequence_shape = modeling.get_shape_list(sequence_tensor, expected_rank=3)
    batch_size = sequence_shape[0]
    seq_length = sequence_shape[1]
    width = sequence_shape[2]

    flat_offsets = tf.reshape(
      tf.range(0, batch_size, dtype=tf.int32) * seq_length, [-1, 1])
    flat_positions = tf.reshape(positions + flat_offsets, [-1])
    flat_sequence_tensor = tf.reshape(sequence_tensor,
                                    [batch_size * seq_length, width])
    output_tensor = tf.gather(flat_sequence_tensor, flat_positions)
    return output_tensor

def get_masked_lm_output(bert_config, input_tensor, output_weights, positions):
    """Get probabilies for the masked LM.
    
    bert_config - instance of BertConfig
    input_tensor - output of bert_model.get_sequence_output()
    output_weights - projection matrix, here we use embeddings matrix and then transpose it
    positions - posistions of MASKED tokens, i.e. at witch positions we want to make predictions
    """
    input_tensor = gather_indexes(input_tensor, positions)

    with tf.variable_scope("cls/predictions"):
        # We apply one more non-linear transformation before the output layer.
        with tf.variable_scope("transform"):
            input_tensor = tf.layers.dense(
              input_tensor,
              units=bert_config.hidden_size,
              activation=modeling.get_activation(bert_config.hidden_act),
              kernel_initializer=modeling.create_initializer(
                  bert_config.initializer_range))
            input_tensor = modeling.layer_norm(input_tensor)

        # The output weights are the same as the input embeddings, but there is
        # an output-only bias for each token.
        output_bias = tf.get_variable(
            "output_bias",
            shape=[bert_config.vocab_size],
            initializer=tf.zeros_initializer())
        logits = tf.matmul(input_tensor, output_weights, transpose_b=True)
        logits = tf.nn.bias_add(logits, output_bias)
        probs = tf.nn.softmax(logits, axis=-1)

    return probs

In [None]:
# define placeholder for MASKED tokens positions
masked_lm_positions_ph = tf.placeholder(shape=(None, None), dtype=tf.int32)

# define predictions for MASKED tokens 
masked_lm_probs = get_masked_lm_output(bert_config, 
                                       bert_model.get_sequence_output(),
                                       bert_model.get_embedding_table(),
                                       masked_lm_positions_ph)

In [None]:
# here we have tensor of shape [batch_size, vocabulary_size] with probabilities
masked_lm_probs

#### 4. Initialize BERT model with pre-trained weights
We have already defined TensorFlow graph for `bert_model`. Next step is to load weights from pre-trained checkpoint.

In [None]:
# define TensorFlow session
sess_config = tf.ConfigProto(allow_soft_placement=True)
sess_config.gpu_options.allow_growth = True
sess = tf.Session(config=sess_config)

init_checkpoint = BERT_MODEL_PATH + 'bert_model.ckpt'

# load from checkpoint
tvars = tf.trainable_variables()
assignment_map, initialized_variable_names = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint)
tf.train.init_from_checkpoint(init_checkpoint, assignment_map)

sess.run(tf.global_variables_initializer())

`bert_model` is loaded. Let's check its outputs.

In [None]:
bert_model_output = sess.run(bert_model.get_sequence_output(), feed_dict={
    input_ids_ph: [input_example.input_ids],
    input_masks_ph: [input_example.input_mask],
    token_types_ph: [input_example.input_type_ids],
})
print('bert_model sequence output shape:', bert_model_output.shape)
print('bert_model sequence output:', bert_model_output)

#### 5. Masked Language Modeling with BERT
BERT model was trained on Masked Language Modeling task. It is a task of predicting word by its context:
```
Bob is a [MASK] man.
```
Masked Language Models answer the question: Which token could be hidden with `[MASK]` token?


In this part of the Tutorial we will use BERT to answer such question. 

We will start with preprocessing an input text: we need to put `[MASK]` tokens somewhere in the input text. To do this we need to known `[MASK]` token id.

In [None]:
from bert_dp import tokenization

tokenizer = tokenization.FullTokenizer(
    vocab_file=BERT_MODEL_PATH + 'vocab.txt',
    do_lower_case=True,
)

MASK_TOKEN = '[MASK]'
MASK_ID = tokenizer.convert_tokens_to_ids([MASK_TOKEN])[0]
MASK_ID

Next we define function which will replace some tokens in `input_example` to `[MASK]`. `put_mask_tokens` function will have to modify `input_example.tokens` and `input_example.input_ids`.

In [None]:
from copy import deepcopy
import numpy as np


def put_mask_tokens(input_example, positions):
    """
    Puts `[MASK]` tokens at each position in `positions` list.
    Updates values of input_example's tokens and input_ids.
    Returns updated input_example and masked_lm_positions
    
    input_example - result of BertPreprocessor with tokens, input_ids, and so on.
    positions - list of subtokens positions to change to `[MASK]`
    """
    input_example = deepcopy(input_example)
    #### YOUR CODE HERE START ####
    
    #### YOUR CODE HERE END ####
    masked_lm_positions = [i for i in range(len(input_example.tokens)) if input_example.tokens[i] == MASK_TOKEN]
    return input_example, masked_lm_positions

In [None]:
# check your implementation of `put_mask_tokens`
bp = BertPreprocessor(vocab_file=BERT_MODEL_PATH + 'vocab.txt', do_lower_case=True, max_seq_length=16)
input_example = bp(texts_a = ['Bob is a good man.'], texts_b = ['He has three kids.'])[0]
input_example_masked, masked_lm_positions = put_mask_tokens(input_example, positions=[4, 8, 10])

In [None]:
print('Testing put_mask_tokens')
assert(input_example_masked.tokens == ['[CLS]', 'bob', 'is', 'a', '[MASK]', 'man', '.', '[SEP]', '[MASK]', 'has', '[MASK]', 'kids', '.', '[SEP]'])
assert(input_example_masked.input_ids == [101, 3960, 2003, 1037, 103, 2158, 1012, 102, 103, 2038, 103, 4268, 1012, 102, 0, 0])
print('Test passed')

Now, as we have `input_example` with masked tokens, we are ready to predict tokens which are masked with `[MASK]` token. `masked_lm_probs` returns probability distribution, we can use `argmax` to get the most probable token id. 

In [None]:
probs = sess.run(masked_lm_probs, feed_dict={
    input_ids_ph: [input_example_masked.input_ids],
    input_masks_ph: [input_example_masked.input_mask],
    token_types_ph: [input_example_masked.input_type_ids],
    masked_lm_positions_ph: [masked_lm_positions],
})

print('input       :', input_example.tokens)
print('masked input:', input_example_masked.tokens)
for i, p in enumerate(probs):
    print(f'prediction for {i}th MASK token:', tokenizer.convert_ids_to_tokens([np.argmax(p)]))

Try to make predictions with different input `texts_a`, `texts_b` and `positions`. How it works?



In [None]:
bp = BertPreprocessor(vocab_file=BERT_MODEL_PATH + 'vocab.txt', do_lower_case=True, max_seq_length=16)

# change input example and/or masked positions
input_example = bp(texts_a = ['Bob is a good man.'], texts_b = ['He has three kids.'])[0]
input_example_masked, masked_lm_positions = put_mask_tokens(input_example, positions=[4, 8, 10])

probs = sess.run(masked_lm_probs, feed_dict={
    input_ids_ph: [input_example_masked.input_ids],
    input_masks_ph: [input_example_masked.input_mask],
    token_types_ph: [input_example_masked.input_type_ids],
    masked_lm_positions_ph: [masked_lm_positions],
})

print('input       :', input_example.tokens)
print('masked input:', input_example_masked.tokens)
for i, p in enumerate(probs):
    print(f'prediction for {i}th MASK token:', tokenizer.convert_ids_to_tokens([np.argmax(p)]))

Just predicting masked tokens might has useful applications, like replacing tokens to similar in a context for paraphrasing or data augmentation.

But how to use these MASK tokens to make BERT generate continuation of phrase "Bob is a good man. He has..."?

In [None]:
bp = BertPreprocessor(vocab_file=BERT_MODEL_PATH + 'vocab.txt', do_lower_case=True, max_seq_length=16)
input_example = bp(texts_a = ['Bob is a good man.'], texts_b = ['He has three kids.'])[0]

#### YOUR CODE HERE START ####
# put [MASK] for every token after `He has`
# positions = 
#### YOUR CODE HERE END ####

input_example_masked, masked_lm_positions = put_mask_tokens(input_example, positions=positions)

print('Test positions')
assert(input_example_masked.tokens == ['[CLS]', 'bob', 'is', 'a', 'good', 'man', '.', '[SEP]', 'he', 'has', '[MASK]', '[MASK]', '[MASK]', '[SEP]'])
print('Test passed')

probs = sess.run(masked_lm_probs, feed_dict={
    input_ids_ph: [input_example_masked.input_ids],
    input_masks_ph: [input_example_masked.input_mask],
    token_types_ph: [input_example_masked.input_type_ids],
    masked_lm_positions_ph: [masked_lm_positions],
})

print('input       :', input_example.tokens)
print('masked input:', input_example_masked.tokens)
for i, p in enumerate(probs):
    print(f'prediction for {i}th MASK token:', tokenizer.convert_ids_to_tokens([np.argmax(p)]))

#### 6. Text generation with BERT
In previous example BERT independently predicted all three masked tokens. It is not the best behavior for text generation model. Consider example
```
The weather in [MASK] [MASK] is hot.
```

Independetly predicting model can output `New York`, `San York`, `New Francisco`, `San Francisco` even if some of these cities do not exist. But if model generates tokens sequentially, prediction of second token is conditioned on the first token (`New` or `San`). It will eliminate chance of `San York`, `New Francisco` to be generated.

The same motivation is in the XLNet paper (https://arxiv.org/abs/1906.08237), recent work that criticise BERT model training scheme and proposes sequential prediction of masked tokens. As result, XLNet outperforms BERT on a wide range of NLP tasks.


Let's sequentially generate text with pre-trained BERT model.


At first, we need function to append generated tokens (or mask tokens) to the end of `input_example`. We will use this functions during text generation and to create initial `input_example` with `[MASK]` tokens.

In [None]:
def append_tokens(input_example, token=MASK_TOKEN, token_id=MASK_ID, n=3):
    """
    This function appends `token` to `input_example` `n` times.
    Also, it maintains correct values for `input_mask`, `input_ids`, `input_type_ids`.
    Don't forget that [SEP] token is always the last token.
    
    input_example - result of BertPreprocessor with tokens, input_ids, ...
    token - token to append
    token_id - token id to append
    n - how many times to append token to input_example
    """
    input_example = deepcopy(input_example)
    max_seq_len = len(input_example.input_mask)
    input_len = sum(input_example.input_mask)
    
    # here we insert token n times just before the last [SEP] token
    new_tokens = (input_example.tokens[:input_len - 1] + [token] * n + input_example.tokens[input_len-1:])[:max_seq_len]
    input_example.tokens = new_tokens
    assert len(new_tokens) <= max_seq_len
    
    # here you should insert mask values
    # new_input_mask = YOUR CODE HERE
    input_example.input_mask = new_input_mask
    assert len(new_input_mask) <= max_seq_len
    
    # here you should insert token id
    # new_input_ids = YOUR CODE HERE
    input_example.input_ids = new_input_ids
    assert len(new_input_ids) <= max_seq_len
    
    # here you should insert token_type_id which is 1 for the second sentence
    # new_input_type_ids = YOUR CODE HERE
    new_input_type_ids = (input_example.input_type_ids[:input_len - 1] + [1] * n + input_example.input_type_ids[input_len-1:])[:max_seq_len]
    input_example.input_type_ids = new_input_type_ids
    assert len(new_input_type_ids) <= max_seq_len
    
    return input_example, [i for i in range(len(input_example.tokens)) if input_example.tokens[i] == MASK_TOKEN]

In [None]:
# check your implementation of `append_tokens` function
max_seq_len = 16
bp = BertPreprocessor(vocab_file=BERT_MODEL_PATH + 'vocab.txt', do_lower_case=True, max_seq_length=max_seq_len)
input_example = bp(texts_a = ['Bob is a good man.'],
                   texts_b = ['He has'])[0]
appended_example, _ = append_tokens(input_example, n=3)

In [None]:
print('Testing append_tokens')
assert(appended_example.tokens == ['[CLS]', 'bob', 'is', 'a', 'good', 'man', '.', '[SEP]', 'he', 'has', '[MASK]', '[MASK]', '[MASK]', '[SEP]'])
assert(appended_example.input_ids == [101, 3960, 2003, 1037, 2204, 2158, 1012, 102, 2002, 2038, 103, 103, 103, 102, 0, 0])
assert(appended_example.input_mask == [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0])
assert(appended_example.input_type_ids == [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0])
print('Test passed')

`generate_text` function will be used for sequential text generation with pre-trained BERT. Check its docstrings for details.

In [None]:
def generate_text(input_example, sampling_method='greedy', mask_tokens_n=3, max_generated_tokens=15):
    """
    This function generates text using input_example as initial text.
    
    Text generation stops when one of ['.', '?', '!'] symbols is predicted or 
    achieved number of `max_generated_tokens`
    """
    generated_example = deepcopy(input_example)
    for i in range(max_generated_tokens):
        # Firstly, we append [MASK] tokens to the end of a text.
        # If mask_tokens_n is too small (e.g., 1) then model will predict "." and generation will stop.
        # It happens because BERT learned that the last token in sentences is usually ".".
        masked_input_example, masked_lm_positions = append_tokens(generated_example, n=mask_tokens_n)
        
        # get distribution over vocabulary for the first masked token
        probs = sess.run(masked_lm_probs, feed_dict={
            input_ids_ph: [masked_input_example.input_ids],
            input_masks_ph: [masked_input_example.input_mask],
            token_types_ph: [masked_input_example.input_type_ids],
            masked_lm_positions_ph: [masked_lm_positions],
        })[0]
        
        # sample token from vocabulary using probs
        if sampling_method == 'greedy':
            next_token_id = np.argmax(probs)
        else:
            next_token_id = sampling_method(probs)
        
        # append generated token to text
        next_token = tokenizer.convert_ids_to_tokens([next_token_id])[0]    
        generated_example, _ = append_tokens(generated_example, token=next_token, token_id=next_token_id, n=1)
        
        if generated_example.tokens[-2] in ['.', '?', '!']:
            break

    return generated_example

Let's generate continuation for "Bob is a good man. He has...".  Note that generated text differs from the text which was previously generated with independent `[MASK]` tokens predictions.

In [None]:
max_seq_len = 32
bp = BertPreprocessor(vocab_file=BERT_MODEL_PATH + 'vocab.txt', do_lower_case=True, max_seq_length=max_seq_len)
input_example = bp(texts_a = ['Bob is a good man.'],
                   texts_b = ['He has'])[0]

In [None]:
n_samples = 5
print('greedy')
for j in range(n_samples):
    generated_example = generate_text(input_example, sampling_method='greedy')
    print(' '.join(generated_example.tokens[1:-1]).replace(' ##', '').replace('##', ''))

Different technics can be used for choosing next word during text generation. 

The most simple one is `greedy` approach, when at each decoding step we choose token with the highest probability in vocabulary.

`Greedy` approach makes text generation deterministic, what if we want to generate different possible outputs? Let's sample from distribution over the vocabulary!


In [None]:
def random_sampling(probs):
    """
    Sample from full distribution over vocabulary.
    """
    # renormalize and add 1e-06 to fix floating point overflows
    probs = probs / (np.sum(probs) + 1e-06)
    return np.argmax(np.random.multinomial(n=1, pvals=probs))

In [None]:
print('random')
for j in range(n_samples):
    generated_example = generate_text(input_example, sampling_method=random_sampling)
    print(' '.join(generated_example.tokens[1:-1]).replace(' ##', '').replace('##', ''))

There are two common modification to plain sampling from distribution over the vocabulary: `top_k_sampling` and `top_p_sampling`. Both of them truncate tail of a probability distribution. 

In [None]:
def top_k_sampling(probs, k=10):
    """
    Sample from k tokens with the highest probabilities.
    Don't forget to make top k probabilities sum to 1.
    """
    # get top k indicies from probs using np.argsort
    top_k_tokens_ids = # your code
    # get top k probabilities from probs using top_k_tokens_ids
    top_k_probs = # your code
    # make sure that sum of top_k_probs == 1, renormalize it
    top_k_probs = # your code
    return top_k_tokens_ids[np.argmax(np.random.multinomial(n=1, pvals=top_k_probs))]

In [None]:
k = 10
top_k_10_sampling = lambda x: top_k_sampling(x, 10)
print(f'top k, k={k}')
for j in range(n_samples):
    generated_example = generate_text(input_example, sampling_method=top_k_10_sampling)
    print(' '.join(generated_example.tokens[1:-1]).replace(' ##', '').replace('##', ''))

In [None]:
def top_p_sampling(probs, p=0.9):
    """
    Sample from top tokens with a cumulative probability just above `p`.
    Don't forget to make selected top probabilities sum to 1.
    """
    
    # get indicies sorted by probs using np.argsort in descending order
    sorted_ids = # your code
    # probabilities from probs in descending order
    sorted_probs = # your code
    
    # probabilities such that sum of them is just above `p`:
    # sum(sorted_probs[:j]) > p
    # sum(sorted_probs[:j-1]) < p
    # top_p_probs = sorted_probs[:j]
    # consider the case when sorted_probs[0] > p and define top_p_probs = [sorted_probs[0]]
    top_p_probs = # your code
    # make sure that sum of top_p_probs == 1, renormalize it
    top_p_probs = top_p_probs / sum(top_p_probs)
    return sorted_ids[np.argmax(np.random.multinomial(n=1, pvals=top_p_probs))]

In [None]:
p = 0.9
top_p_09_sampling = lambda x: top_p_sampling(x, p)
print(f'top p, p={p}')
for j in range(n_samples):
    generated_example = generate_text(input_example, sampling_method=top_p_09_sampling)
    print(' '.join(generated_example.tokens[1:-1]).replace(' ##', '').replace('##', ''))


Alternative to these sampling methods is a `beam search`, main idea of `beam search` is to maintain number of beams of the most probable hypotheses and at the end select one of them.

In the paper "The Curious Case of Neural Text Degeneration" (https://arxiv.org/abs/1904.09751) authors compared different decoding methods with human-generated text. They showed that `beam search` generated text is less variative and surprising compared to human-generated. And such sampling techniques like `top_k` and `top_p` showed to generate texts more close to human-generated than `beam search`.

<img src="img/beam_search_vs_human.png" width=50% align="left">

<img src="img/decoding.png" width=50% align="right">

Try to use different inputs, e.g:
```
input_example = bp(texts_a = ['What is love? Baby don\'t hurt me'],
                   texts_b = ['Don\'t hurt me'])[0]

input_example = bp(texts_a = ['- That was a good day, isn\'t it?'],
                   texts_b = ['- '])[0]
```

The last one example makes BERT to behave like zero-shot chit-chat model!

In [None]:
max_seq_len = 32
bp = BertPreprocessor(vocab_file=BERT_MODEL_PATH + 'vocab.txt', do_lower_case=True, max_seq_length=max_seq_len)

input_example = # YOUR CODE

n_samples = 5
print('greedy')
for j in range(n_samples):
    generated_example = generate_text(input_example, sampling_method='greedy')
    print(' '.join(generated_example.tokens[1:-1]).replace(' ##', '').replace('##', ''))
    
print('random')
for j in range(n_samples):
    generated_example = generate_text(input_example, sampling_method=random_sampling)
    print(' '.join(generated_example.tokens[1:-1]).replace(' ##', '').replace('##', ''))
    
k = 10
top_k_10_sampling = lambda x: top_k_sampling(x, 10)
print(f'top k, k={k}')
for j in range(n_samples):
    generated_example = generate_text(input_example, sampling_method=top_k_10_sampling)
    print(' '.join(generated_example.tokens[1:-1]).replace(' ##', '').replace('##', ''))
    
p = 0.9
top_p_09_sampling = lambda x: top_p_sampling(x, p)
print(f'top p, p={p}')
for j in range(n_samples):
    generated_example = generate_text(input_example, sampling_method=top_p_09_sampling)
    print(' '.join(generated_example.tokens[1:-1]).replace(' ##', '').replace('##', ''))

#### 7. Next steps

For now, we used only pre-trained BERT model for text generation, which was not trained for this task and it worked quite well. 

What if to train BERT model on sequence generation task like chit-chat on [OpenSubtitles](http://opus.nlpl.eu/OpenSubtitles-v2018.php) or [PersonaChat](https://arxiv.org/abs/1801.07243) dataset?

What changes should be made in BERT model to become a sequence generative model? How to induce casuality to the BERT model?


### Resources and additional materials
* Google's BERT [repo](https://github.com/google-research/bert) and [paper](https://arxiv.org/abs/1810.04805)
* A Transformer Chatbot Tutorial with TensorFlow 2.0 on [Medium](https://medium.com/tensorflow/a-transformer-chatbot-tutorial-with-tensorflow-2-0-88bf59e66fe2)
* HugginFace's [blogpost](https://medium.com/huggingface/how-to-build-a-state-of-the-art-conversational-ai-with-transfer-learning-2d818ac26313#7b60) on How to build a State-of-the-Art Conversational AI with Transfer Learning and their [Demo](https://convai.huggingface.co/)
* XLNet paper: https://arxiv.org/abs/1906.08237
* The Curious Case of Neural Text Degeneration paper: https://arxiv.org/abs/1904.09751