<a href="https://colab.research.google.com/github/sushil79g/60daysUdacity/blob/master/Conversational_ai%20/transfer_convai.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### This notebook is optionally accelerated with a GPU runtime.
### If you would like to use this acceleration, please select the menu option "Runtime" -> "Change runtime type", select "Hardware Accelerator" -> "GPU" and click "SAVE"

----------------------------------------------------------------------

# BERT

*Author: HuggingFace Team*

**Bidirectional Encoder Representations from Transformers.**

_ | _
- | -
![alt](https://pytorch.org/assets/images/bert1.png) | ![alt](https://pytorch.org/assets/images/bert2.png)


### Model Description

BERT was released together with the paper [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) by Jacob Devlin et al. The model is based on the Transformer architecture introduced in [Attention Is All You Need](https://arxiv.org/abs/1706.03762) by Ashish Vaswani et al and has led to significant improvements on a wide range of downstream tasks.

Here are 8 models based on BERT with [Google's pre-trained models](https://github.com/google-research/bert) along with the associated Tokenizer.
It includes:
- `bertTokenizer`: perform end-to-end tokenization, i.e. basic tokenization followed by WordPiece tokenization
- `bertModel`: raw BERT Transformer model (fully pre-trained)
- `bertForMaskedLM`: BERT Transformer with the pre-trained masked language modeling head on top (fully pre-trained)
- `bertForNextSentencePrediction`: BERT Transformer with the pre-trained next sentence prediction classifier on top (fully pre-trained)
- `bertForPreTraining`: BERT Transformer with masked language modeling head and next sentence prediction classifier on top (fully pre-trained)
- `bertForSequenceClassification`: BERT Transformer with a sequence classification head on top (BERT Transformer is pre-trained, the sequence classification head is only initialized and has to be trained)
- `bertForMultipleChoice`: BERT Transformer with a multiple choice head on top (used for task like Swag) (BERT Transformer is pre-trained, the multiple choice classification head is only initialized and has to be trained)
- `bertForTokenClassification`: BERT Transformer with a token classification head on top (BERT Transformer is pre-trained, the token classification head is only initialized and has to be trained)
- `bertForQuestionAnswering`: BERT Transformer with a token classification head on top (BERT Transformer is pre-trained, the token classification head is only initialized and has to be trained)

### Requirements

Unlike most other PyTorch Hub models, BERT requires a few additional Python packages to be installed.

In [0]:
%%bash
pip install tqdm boto3 requests regex

Collecting regex
  Downloading https://files.pythonhosted.org/packages/6f/4e/1b178c38c9a1a184288f72065a65ca01f3154df43c6ad898624149b8b4e0/regex-2019.06.08.tar.gz (651kB)
Building wheels for collected packages: regex
  Building wheel for regex (setup.py): started
  Building wheel for regex (setup.py): finished with status 'done'
  Stored in directory: /root/.cache/pip/wheels/35/e4/80/abf3b33ba89cf65cd262af8a22a5a999cc28fbfabea6b38473
Successfully built regex
Installing collected packages: regex
Successfully installed regex-2019.6.8


### Example

Here is an example on how to tokenize the input text with `bertTokenizer`, and then get the hidden states computed by `bertModel` or predict masked tokens using `bertForMaskedLM`. The example also includes snippets showcasing how to use `bertForNextSentencePrediction`, `bertForQuestionAnswering`, `bertForSequenceClassification`, `bertForMultipleChoice`, `bertForTokenClassification`, and `bertForPreTraining`.

In [0]:
### First, tokenize the input
import torch
tokenizer = torch.hub.load('huggingface/pytorch-pretrained-BERT', 'bertTokenizer', 'bert-base-cased', do_basic_tokenize=False)

# Tokenized input
text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
tokenized_text = tokenizer.tokenize(text)
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)

Downloading: "https://github.com/huggingface/pytorch-pretrained-BERT/archive/master.zip" to /root/.cache/torch/hub/master.zip
The pre-trained model you are loading is a cased model but you have not set `do_lower_case` to False. We are setting `do_lower_case=False` for you but you may want to check this behavior.
100%|██████████| 213450/213450 [00:00<00:00, 2415931.60B/s]


In [0]:
### Get the hidden states computed by `bertModel`
# Define sentence A and B indices associated to 1st and 2nd sentences (see paper)
segments_ids = [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]

# Convert inputs to PyTorch tensors
segments_tensors = torch.tensor([segments_ids])
tokens_tensor = torch.tensor([indexed_tokens])

model = torch.hub.load('huggingface/pytorch-pretrained-BERT', 'bertModel', 'bert-base-cased')
model.eval()

with torch.no_grad():
    encoded_layers, _ = model(tokens_tensor, segments_tensors)

Using cache found in /root/.cache/torch/hub/huggingface_pytorch-pretrained-BERT_master
100%|██████████| 313/313 [00:00<00:00, 117017.31B/s]
100%|██████████| 435779157/435779157 [00:12<00:00, 35696160.30B/s]


In [0]:
### Predict masked tokens using `bertForMaskedLM`
# Mask a token that we will try to predict back with `BertForMaskedLM`
masked_index = 8
tokenized_text[masked_index] = '[MASK]'
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
tokens_tensor = torch.tensor([indexed_tokens])

maskedLM_model = torch.hub.load('huggingface/pytorch-pretrained-BERT', 'bertForMaskedLM', 'bert-base-cased')
maskedLM_model.eval()

with torch.no_grad():
    predictions = maskedLM_model(tokens_tensor, segments_tensors)

# Get the predicted token
predicted_index = torch.argmax(predictions[0, masked_index]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
assert predicted_token == 'Jim'

Using cache found in /root/.cache/torch/hub/huggingface_pytorch-pretrained-BERT_master


TypeError: ignored

In [0]:
### Classify next sentence using ``bertForNextSentencePrediction``
# Going back to our initial input
text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
tokenized_text = tokenizer.tokenize(text)
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
tokens_tensor = torch.tensor([indexed_tokens])

nextSent_model = torch.hub.load('huggingface/pytorch-pretrained-BERT', 'bertForNextSentencePrediction', 'bert-base-cased')
nextSent_model.eval()

# Predict the next sentence classification logits
with torch.no_grad():
    next_sent_classif_logits = nextSent_model(tokens_tensor, segments_tensors)

Using cache found in /root/.cache/torch/hub/huggingface_pytorch-pretrained-BERT_master


In [0]:
### Classify next sentence using ``bertForNextSentencePrediction``
nextSent_model = torch.hub.load('huggingface/pytorch-pretrained-BERT', 'bertForNextSentencePrediction', 'bert-base-cased')
nextSent_model.eval()

# Predict the next sentence classification logits
with torch.no_grad():
    next_sent_classif_logits = nextSent_model(tokens_tensor, segments_tensors)

Using cache found in /root/.cache/torch/hub/huggingface_pytorch-pretrained-BERT_master


In [0]:
### Question answering using `bertForQuestionAnswering`
questionAnswering_model = torch.hub.load('huggingface/pytorch-pretrained-BERT', 'bertForQuestionAnswering', 'bert-base-cased')
questionAnswering_model.eval()

# Predict the start and end positions logits
with torch.no_grad():
    start_logits, end_logits = questionAnswering_model(tokens_tensor, segments_tensors)

# Or get the total loss which is the sum of the CrossEntropy loss for the start and end token positions (set model to train mode before if used for training)
start_positions, end_positions = torch.tensor([12]), torch.tensor([14])
multiple_choice_loss = questionAnswering_model(tokens_tensor, segments_tensors, start_positions=start_positions, end_positions=end_positions)

Using cache found in /root/.cache/torch/hub/huggingface_pytorch-pretrained-BERT_master


In [0]:
### Classify sequence using `bertForSequenceClassification`
# Load bertForSequenceClassification
seqClassification_model = torch.hub.load('huggingface/pytorch-pretrained-BERT', 'bertForSequenceClassification', 'bert-base-cased', num_labels=2)
seqClassification_model.eval()

# Predict the sequence classification logits
with torch.no_grad():
    seq_classif_logits = seqClassification_model(tokens_tensor, segments_tensors)

# Or get the sequence classification loss (set model to train mode before if used for training)
labels = torch.tensor([1])
seq_classif_loss = seqClassification_model(tokens_tensor, segments_tensors, labels=labels)

Using cache found in /root/.cache/torch/hub/huggingface_pytorch-pretrained-BERT_master


In [0]:
### Sequence tagging using `bertForTokenClassification`
tokClassification_model = torch.hub.load('huggingface/pytorch-pretrained-BERT', 'bertForTokenClassification', 'bert-base-cased', num_labels=2)
tokClassification_model.eval()
# Predict the token classification logits
with torch.no_grad():
    classif_logits = model(tokens_tensor, segments_tensors)

# Or get the token classification loss (set model to train mode before if used for training)
labels = torch.tensor([[0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0]])
classif_loss = tokClassification_model(tokens_tensor, segments_tensors, labels=labels)

Using cache found in /root/.cache/torch/hub/huggingface_pytorch-pretrained-BERT_master


In [0]:
### Select answer among multiple choice using `bertForMultipleChoice`
multiplChoice_model = torch.hub.load('huggingface/pytorch-pretrained-BERT', 'bertForMultipleChoice', 'bert-base-cased', num_choices=2)
multiplChoice_model.eval()

tokens_tensor = torch.tensor([[indexed_tokens, indexed_tokens]])
segments_tensors = torch.tensor([[segments_ids, segments_ids]])

# Predict the multiple choice logits
with torch.no_grad():
    multiple_choice_logits = multiplChoice_model(tokens_tensor, segments_tensors)

# Or get the multiple choice loss (set model to train mode before if used for training)
labels = torch.tensor([1])
multiple_choice_loss = multiplChoice_model(tokens_tensor, segments_tensors, labels=labels)

Using cache found in /root/.cache/torch/hub/huggingface_pytorch-pretrained-BERT_master


TypeError: ignored

In [0]:
### Fine-tune BERT using `bertForPreTraining`
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])

forPretraining_model = torch.hub.load('huggingface/pytorch-pretrained-BERT', 'bertForPreTraining', 'bert-base-cased')
masked_lm_logits_scores, seq_relationship_logits = forPretraining_model(tokens_tensor, segments_tensors)

Using cache found in /root/.cache/torch/hub/huggingface_pytorch-pretrained-BERT_master


### Resources

 - Paper: [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)
 - Initial repository (with detailed examples and documentation): [pytorch-pretrained-BERT](https://github.com/huggingface/pytorch-pretrained-BERT)

In [0]:
from pytorch_pretrained_bert import OpenAIGPTDoubleHeadsModel, OpenAIGPTTokenizer

In [0]:
# !pip install pytorch_pretrained_bert
model = OpenAIGPTDoubleHeadsModel.from_pretrained('openai-gpt')
tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')

100%|██████████| 478750579/478750579 [00:09<00:00, 52993865.18B/s]
100%|██████████| 273/273 [00:00<00:00, 47528.02B/s]
100%|██████████| 815973/815973 [00:00<00:00, 3869230.47B/s]
100%|██████████| 458495/458495 [00:00<00:00, 3420982.02B/s]
ftfy or spacy is not installed using BERT BasicTokenizer instead of SpaCy & ftfy.


In [0]:
SPECIAL_TOKENS = ["<bos>", "<eos>", "<speaker1>", "<speaker2>", "<pad>"]

In [0]:
tokenizer.set_special_tokens(SPECIAL_TOKENS)
model.set_num_special_tokens(len(SPECIAL_TOKENS))

In [0]:
from itertools import chain
persona = [["i","like","playing","football","."],
          ["i","am","from","NYC","."]]
history = [["hello","how","are","you","?"],
          ["i","am","fine","thanks","."]]
reply = ["great","to","hear"]

In [0]:
bos, eos, speaker1, speaker2 = "<bos>","<eos>","<speaker1>","<speaker2>"
def build_inputs(persona, history, reply):
    sequence = [[bos] + list(chain(*persona))] + history + [reply + [eos]]
    sequence = [sequence[0]] + [[speaker2 if (len(sequence)-1) %2 else speaker1] +s for i,s in enumerate(sequence[1:])]
    words = list(chain(*sequence))
    segments = [speaker2 if i % 2 else speaker1
                for i,s in enumerate(sequence) for _ in s]
    position = list(range(len(words)))
    return words, segments, position, sequence

In [0]:
words, segments, position, sequence = build_inputs(persona, history, reply)

In [0]:
print(words)

['<bos>', 'i', 'like', 'playing', 'football', '.', 'i', 'am', 'from', 'NYC', '.', '<speaker2>', 'hello', 'how', 'are', 'you', '?', '<speaker2>', 'i', 'am', 'fine', 'thanks', '.', '<speaker2>', 'great', 'to', 'hear', '<eos>']


In [0]:
print(segments)

['<speaker1>', '<speaker1>', '<speaker1>', '<speaker1>', '<speaker1>', '<speaker1>', '<speaker1>', '<speaker1>', '<speaker1>', '<speaker1>', '<speaker1>', '<speaker2>', '<speaker2>', '<speaker2>', '<speaker2>', '<speaker2>', '<speaker2>', '<speaker1>', '<speaker1>', '<speaker1>', '<speaker1>', '<speaker1>', '<speaker1>', '<speaker2>', '<speaker2>', '<speaker2>', '<speaker2>', '<speaker2>']


In [0]:
print(position)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27]


In [0]:
print(sequence)

[['<bos>', 'i', 'like', 'playing', 'football', '.', 'i', 'am', 'from', 'NYC', '.'], ['<speaker2>', 'hello', 'how', 'are', 'you', '?'], ['<speaker2>', 'i', 'am', 'fine', 'thanks', '.'], ['<speaker2>', 'great', 'to', 'hear', '<eos>']]


In [0]:
words = tokenizer.convert_tokens_to_ids(words)
segments = tokenizer.convert_tokens_to_ids(segments)

In [0]:
print(words)

[40478, 11, 14594, 0, 0, 1, 11, 1574, 0, 0, 1, 40481, 0, 1991, 2183, 7159, 19, 40481, 11, 1574, 0, 12389, 1, 40481, 5201, 571, 863, 40479]


In [0]:
print(segments)

[40480, 40480, 40480, 40480, 40480, 40480, 40480, 40480, 40480, 40480, 40480, 40481, 40481, 40481, 40481, 40481, 40481, 40480, 40480, 40480, 40480, 40480, 40480, 40481, 40481, 40481, 40481, 40481]


In [0]:
import torch

# Let's add a distractor to our previously defined persona, history and reply
distractor = ["sorry", "to", "hear", "that"]

# Build & tokenize inputs ending with our distractor like we did with the gold reply
words_distractor, segments_distractor, _, _ = build_inputs(persona, history, distractor)
words_distractor = tokenizer.convert_tokens_to_ids(words_distractor)
segments_distractor = tokenizer.convert_tokens_to_ids(segments_distractor)

# Prepare our language modeling targets: keep only the reply segment, -1 on the rest
lm_targets = ([-1] * sum(len(s) for s in sequence[:-1])) \
             + [-1] + tokenizer.convert_tokens_to_ids(sequence[-1][1:])
lm_distractor = [-1] * len(words_distractor)

# Store the position of the last tokens for the next-sentence prediction loss
last_token = len(words) - 1
last_token_distractor = len(words_distractor) - 1

# Now we can pad reply and distractor inputs and targets to the same length
padding_length = max(len(words), len(words_distractor))
def pad(x, padding):
    return x + [padding] * (padding_length - len(x))

(words, words_distractor,
 segments, segments_distractor) = [pad(x, tokenizer.convert_tokens_to_ids('<pad>'))
                                   for x in (words, words_distractor,
                                             segments, segments_distractor)]

(lm_targets, lm_distractor) = [pad(x, -1) for x in (lm_targets, lm_distractor)]
 
# And gather reply and distractor inputs to build the input tensors:
# words tokens
input_ids = torch.tensor([[words, words_distractor]], dtype=torch.long)
# segment tokens
token_type_ids = torch.tensor([[segments, segments_distractor]], dtype=torch.long)
# Positions tokens can be automatically created by the model as (0, 1, ..., N)
# Last tokens location
mc_token_ids = torch.tensor([[last_token, last_token_distractor]], dtype=torch.long)
# Language modeling labels
lm_labels = torch.tensor([[lm_targets, lm_distractor]], dtype=torch.long)
# Next-sentence prediction labels
mc_labels = torch.tensor([0], dtype=torch.long)  # Gold reply is 1st (index 0)


In [0]:
lm_loss, mc_loss = model(input_ids, mc_token_ids, lm_labels, mc_labels, token_type_ids)

# Total loss as a weighted sum
lm_coef = 2.0
mc_coef = 1.0
total_loss = lm_loss * lm_coef + mc_loss * mc_coef

In [0]:
# import json
# import logging
# import os
# import tarfile
# import tempfile

# import torch

# from pytorch_pretrained_bert import cached_path

# PERSONACHAT_URL = "https://s3.amazonaws.com/datasets.huggingface.co/personachat/personachat_self_original.json"
# HF_FINETUNED_MODEL = "https://s3.amazonaws.com/models.huggingface.co/transfer-learning-chatbot/finetuned_chatbot_gpt.tar.gz"

# # logger = logging.getLogger(__file__)

# def download_pretrained_model():
#     """ Download and extract finetuned model from S3 """
#     resolved_archive_file = cached_path(HF_FINETUNED_MODEL)
#     tempdir = tempfile.mkdtemp()

#     logger.info("extracting archive file {} to temp dir {}".format(resolved_archive_file, tempdir))
#     with tarfile.open(resolved_archive_file, 'r:gz') as archive:
#         archive.extractall(tempdir)
#     return tempdir


# def get_dataset(tokenizer, dataset_path, dataset_cache=None):
#     """ Get PERSONACHAT from S3 """
#     dataset_path = dataset_path or PERSONACHAT_URL
#     dataset_cache = dataset_cache + '_' + type(tokenizer).__name__  # Do avoid using GPT cache for GPT-2 and vice-versa
#     if dataset_cache and os.path.isfile(dataset_cache):
#         logger.info("Load tokenized dataset from cache at %s", dataset_cache)
#         dataset = torch.load(dataset_cache)
#     else:
#         logger.info("Download dataset from %s", dataset_path)
#         personachat_file = cached_path(dataset_path)
#         with open(personachat_file, "r", encoding="utf-8") as f:
#             dataset = json.loads(f.read())

#         logger.info("Tokenize and encode the dataset")
#         def tokenize(obj):
#             if isinstance(obj, str):
#                 return tokenizer.convert_tokens_to_ids(tokenizer.tokenize(obj))
#             if isinstance(obj, dict):
#                 return dict((n, tokenize(o)) for n, o in obj.items())
#             return list(tokenize(o) for o in obj)
#         dataset = tokenize(dataset)
#         if dataset_cache:
#             torch.save(dataset, dataset_cache)
#     return dataset

# def get_dataset_personalities(tokenizer, dataset_path, dataset_cache=None):
#     """ Get personalities from PERSONACHAT """
#     dataset_path = dataset_path or PERSONACHAT_URL
#     dataset_cache = dataset_cache + '_' + type(tokenizer).__name__  # Do avoid using GPT cache for GPT-2 and vice-versa
#     if os.path.isfile(dataset_cache):
#         logger.info("Load tokenized dataset from cache at %s", dataset_cache)
#         personachat = torch.load(dataset_cache)
#     else:
#         logger.info("Download PERSONACHAT dataset from %s", dataset_path)
#         personachat_file = cached_path(dataset_path)
#         with open(personachat_file, "r", encoding="utf-8") as f:
#             personachat = json.loads(f.read())

#         logger.info("Tokenize and encode the dataset")
#         def tokenize(obj):
#             if isinstance(obj, str):
#                 return tokenizer.convert_tokens_to_ids(tokenizer.tokenize(obj))
#             if isinstance(obj, dict):
#                 return dict((n, tokenize(o)) for n, o in obj.items())
#             return list(tokenize(o) for o in obj)
#         personachat = tokenize(personachat)
#         torch.save(personachat, dataset_cache)

#     logger.info("Filter personalities")
#     personalities = []
#     for dataset in personachat.values():
#         for dialog in dataset:
#             personalities.append(dialog["personality"])

#     logger.info("Gathered {} personalities".format(len(personalities)))
#     return personalities

In [0]:
import json
from pytorch_pretrained_bert import cached_path

url = "https://s3.amazonaws.com/datasets.huggingface.co/personachat/personachat_self_original.json"
# Download and load JSON dataset
personachat_file = cached_path(url)
with open(personachat_file, "r", encoding="utf-8") as f:
    dataset = json.loads(f.read())

# Tokenize and encode the dataset using our loaded GPT tokenizer
def tokenize(obj):
    if isinstance(obj, str):
        return tokenizer.convert_tokens_to_ids(tokenizer.tokenize(obj))
    if isinstance(obj, dict):
        return dict((n, tokenize(o)) for n, o in obj.items())
    return list(tokenize(o) for o in obj)
 
dataset = tokenize(dataset)

100%|██████████| 209850483/209850483 [00:03<00:00, 56462596.02B/s]


In [0]:
def top_k_top_p_filtering(logits, top_k=0, top_p=0.0, filter_value=-float('Inf')):
    """ Filter a distribution of logits using top-k and/or nucleus (top-p) filtering
        Args:
            logits: logits distribution shape (vocabulary size)
            top_k >0: keep only top k tokens with highest probability (top-k filtering).
            top_p >0.0: keep the top tokens with cumulative probability >= top_p (nucleus filtering).
                Nucleus filtering is described in Holtzman et al. (http://arxiv.org/abs/1904.09751)
    """
    assert logits.dim() == 1  # batch size 1 for now - could be updated for more but the code would be less clear
    top_k = min(top_k, logits.size(-1))  # Safety check
    if top_k > 0:
        # Remove all tokens with a probability less than the last token of the top-k
        indices_to_remove = logits < torch.topk(logits, top_k)[0][..., -1, None]
        logits[indices_to_remove] = filter_value

    if top_p > 0.0:
        sorted_logits, sorted_indices = torch.sort(logits, descending=True)
        cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)

        # Remove tokens with cumulative probability above the threshold
        sorted_indices_to_remove = cumulative_probs > top_p
        # Shift the indices to the right to keep also the first token above the threshold
        sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
        sorted_indices_to_remove[..., 0] = 0

        indices_to_remove = sorted_indices[sorted_indices_to_remove]
        logits[indices_to_remove] = filter_value
    return logits

# Here is how to use this function for top-p sampling
temperature = 1.0
top_k = 0
top_p = 0.9

# Get logits with a forward pass in our model (input is pre-defined)
logits = model(input)

# Keep only the last token predictions of the first batch item (batch size 1), apply a temperature coefficient and filter
logits = logits[0, -1, :] / temperature
filtered_logits = top_k_top_p_filtering(logits, top_k=top_k, top_p=top_p)

# Sample from the filtered distribution
probabilities = F.softmax(filtered_logits, dim=-1)
next_token = torch.multinomial(probabilities, 1)

TypeError: ignored