# **Google BERT Transformers with Pytorch Example** 

**BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding**


PyTorch Transformers can be installed using pip below and few pre-requisites:
*   So change the Run Time type to GPU 
*   Use version Python 3.5+ and PyTorch 1.1.0
*   Import the required Libraries






## What is BERT?
A new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. BERT is designed to pre- train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a re- sult, the pre-trained BERT model can be fine- tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task- specific architecture modifications.

There are a few things I want to explain in this section.

1. It’s easy to get that BERT stands for Bidirectional Encoder Representations from Transformers. Each word here has a meaning to it and we will encounter that one by one. For now, the key takeaway from this line is – BERT is based on the Transformer architecture.

2. BERT is pre-trained on a large corpus of unlabelled text including the entire Wikipedia(that’s 2,500 million words!) and Book Corpus (800 million words). This pretraining step is really important for BERT's success. This is because as we train a model on a large text corpus, our model starts to pick up the deeper and intimate understandings of how the language works. This knowledge is the swiss army knife that is useful for almost any NLP task.

3. BERT is a deeply bidirectional model. Bidirectional means that BERT learns information from both the left and the right side of a token’s context during the training phase.

BERT was built upon recent work in pre-training contextual representations — including Semi-supervised Sequence Learning, Generative Pre-Training, ELMo, and ULMFit — but crucially these models are all unidirectional or shallowly bidirectional. This means that each word is only contextualized using the words to its left (or right). 



For example, in the sentence I made a bank deposit the unidirectional representation of bank is only based on I made a but not deposit. Some previous work does combine the representations from separate left-context and right-context models, but only in a "shallow" manner.

 **Example**- BERT represents "bank" using both its left and right context — 
 **I made a ... deposit** — starting from the very bottom of a deep neural network, so it is deeply bidirectional.

![](https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2019/09/sent_context.png)

## Main concepts

The library is build around three type of classes for each models:

1. **model classes** which are PyTorch models (torch.nn.Modules) of the 8 models architectures currently provided in the library, e.g. BertModel
2. **configuration classes** which store all the parameters required to build a model, e.g. BertConfig. You don’t always need to instantiate these your-self, in particular if you are using a pretrained model without any modification, creating the model will automatically take care of instantiating the configuration (which is part of the model)
3. **tokenizer classes** which store the vocabulary for each model and provide methods for encoding/decoding strings in list of token embeddings indices to be fed to a model, e.g. BertTokenizer
All these classes can be instantiated from pretrained instances and saved locally using two methods:

**from_pretrained()** let you instantiate a model/configuration/tokenizer from a pretrained version either provided by the library itself (currently 27 models are provided as listed here) or stored locally (or on a server) by the user.

**save_pretrained()** let you save a model/configuration/tokenizer locally so that it can be reloaded using from_pretrained().

### 1. Bert Model Architecture

BERT’s model architecture is a multi-layer bidirectional Transformer encoder

1. **BERT-Large, Uncased (Whole Word Masking)**: 24-layer, 1024-hidden, 16-heads, 340M parameters
2. **BERT-Large, Cased (Whole Word Masking)**: 24-layer, 1024-hidden, 16-heads, 340M parameters
3. **BERT-Base, Uncased**: 12-layer, 768-hidden, 12-heads, 110M parameters
4. **BERT-Large, Uncased**: 24-layer, 1024-hidden, 16-heads, 340M parameters
5. **BERT-Base, Cased**: 12-layer, 768-hidden, 12-heads , 110M parameters
6. **BERT-Large, Cased**: 24-layer, 1024-hidden, 16-heads, 340M parameters

![](http://jalammar.github.io/images/bert-base-bert-large-encoders.png)

We denote the number of layers (i.e., Transformer blocks) 
as **L**, the **hidden size** as **H**, and 
the number of self-attention heads as We primarily report results on 
two model sizes: 
1. **BERTBASE (L=12, H=768, A=12, Total Parameters=110M)** 
2. **BERTLARGE (L=24, H=1024, A=16, Total Parameters=340M)**

Load pre-trained model tokenizer (vocabulary)
The base class PreTrainedTokenizer implements the common methods for loading/saving a tokenizer either from a local file or directory, or from a pretrained tokenizer provided by the library (downloaded from HuggingFace’s AWS S3 repository).

```
class transformers.BertModel
```
Parameters
*   config (BertConfig) – Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

**Example**
Indices of input sequence tokens in the vocabulary. To match pre-training, BERT input sequence should be formatted with [CLS] and [SEP] tokens as follows:

1. **For sequence pairs:**

tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]

token_type_ids: 0   0  0    0    0    0    0   0   1  1  1  1 1   1

2. **For single sequences:**

tokens:[CLS] the dog is hairy . [SEP]

token_type_ids:   0   0   0   0  0     0   0





### 2. Bert Tokenizer
**BertTokenizer** = Tokenizer classes which store the vocabulary for each model and provide methods for encoding/decoding strings in list of token embeddings indices to be fed to a model eg DistilBertTokenizer,BertTokenizer etc
![](https://jalammar.github.io/images/distilBERT/bert-distilbert-tokenization-1.png)



**PreTrainedTokenizer** is the main entry point into tokenizers as it also implements the main methods for using all the tokenizers:

*   tokenizing, converting tokens to ids and back and encoding/decoding,
*   adding new tokens to the vocabulary in a way that is independant of the underlying structure (BPE, SentencePiece…),
*   managing special tokens (adding them, assigning them to roles, making sure they are not split during tokenization)

**Main Class from where the Tokenizer is been called**
```
class transformers.BertTokenizer(vocab_file, do_lower_case=True, do_basic_tokenize=True, never_split=None, unk_token='[UNK]', sep_token='[SEP]', pad_token='[PAD]', cls_token='[CLS]', mask_token='[MASK]', tokenize_chinese_chars=True, **kwargs)

```

*Parameters*


*  **vocab_file** – Path to a one-wordpiece-per-line vocabulary file
*  **do_lower_case** – Whether to lower case the input. Only has an effect when                     do_basic_tokenize=True

*  **do_basic_tokenize** – Whether to do basic tokenization before wordpiece.
*  **max_len** – An artificial maximum length to truncate tokenized sequences to; Effective maximum length is always the minimum of this value (if             specified) and the underlying BERT model’s sequence length.

*   **never_split** – List of tokens which will never be split during                        tokenization. Only has an effect when do_basic_tokenize=True


![](https://www.researchgate.net/publication/323904682/figure/fig1/AS:606458626465792@1521602412057/The-Transformer-model-architecture.png)

So to have a detail architecture of how Encoder-Decoder works here is few [Link1](https://arxiv.org/pdf/1706.03762.pdf) & visual [Link2](http://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/) 

**ARCHITECTURE**: 
1. **Encoder**: The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position- wise fully connected feed-forward network. We employ a residual connection [11] around each of the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension dmodel = 512.

2. **Decoder:** The decoder is also composed of a stack of N = 6 identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization. We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i.

Let’s try to classify the sentence “a visually stunning rumination on love”. The first step is to use the BERT tokenizer to first split the word into tokens. Then, we add the special tokens needed for sentence classifications (these are [CLS] at the first position, and [SEP] at the end of the sentence).

[CLS] is a special symbol added in front of every input example, and [SEP] is a special separator token (e.g. separating ques- tions/answers).

![](https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2019/09/bert_emnedding.png)

BERT input representation. The **input embeddings** are the sum of the **token embeddings**, the **segmentation embeddings** and **the position embeddings**.

1. The first token of every sequence is always a special clas- sification token (**[CLS]**). 
2. The final hidden state corresponding to this token is used as the ag- gregate sequence representation for classification tasks. Sentence pairs are packed together into a single sequence.
3. We differentiate the sentences in two ways. First, we separate them with a special token (**[SEP]**). Second, we add a learned embed- ding to every token indicating whether it belongs to sentence A or sentence B. 
4.A positional embedding is also added to each token to indicate its position in the sequence.


### 3. Configuration Class


The base class PretrainedConfig implements the common methods for loading/saving a configuration either from a local file or directory, or from a pretrained model configuration provided by the library (downloaded from HuggingFace’s AWS S3 repository).



```
class transformers.PretrainedConfig(**kwargs)
```
Base class for all configuration classes. Handles a few parameters common to all models’ configurations as well as methods for loading/downloading/saving configurations.

*Parameters*

*  **finetuning_task** – string, default None. Name of the task used to fine-tune the model. This can be used when converting from an original (TensorFlow or PyTorch) checkpoint.

*   **num_labels** – integer, default 2. Number of classes to use when the model is a classification model (sequences/tokens)
*   **output_hidden_states** – string, default False. Should the model returns all hidden-states.

*   **output_attentions** – boolean, default False. Should the model returns attentions weights.
*   **torchscript** – string, default False. Is the model used with Torchscript.





# Model Building

Using BERT has two stages: Pre-training and fine-tuning.

**Pre-training** :It is fairly expensive (four days on 4 to 16 Cloud TPUs), but is a one-time procedure for each language (current models are English-only, but multilingual models will be released in the near future). We are releasing a number of pre-trained models from the paper which were pre-trained at Google. Most NLP researchers will never need to pre-train their own model from scratch.

**Fine-tuning** : It is inexpensive. All of the results in the paper can be replicated in at most 1 hour on a single Cloud TPU, or a few hours on a GPU, starting from the exact same pre-trained model. SQuAD, for example, can be trained in around 30 minutes on a single Cloud TPU to achieve a Dev F1 score of 91.0%, which is the single system state-of-the-art.


![](https://insidebigdata.com/wp-content/uploads/2019/10/Peltarion_pic2.jpg)

Here we will see two pre-trained Models
> *Model 1*: **Masked LM**

> *Model 2*: **Next Sentence Prediction (NSP)**

## Model 1:Masked LM

**STEPS**

1. Deep bidirectional model is strictly more powerful than either a left-to-right model or the shallow concatenation of a left-to- right and a right-to-left model. 
 the model could trivially predict the target word in a multi-layered context.

2. In order to train a deep bidirectional representa- tion, we simply mask some percentage of the input tokens at random, and then predict those masked tokens. 

3. In this case, the final hidden vectors corresponding to the mask tokens are fed into an output softmax over the vocabulary, as in a standard LM. 


Although this allows us to obtain a bidirectional pretrained model, a downside is that we are creating a mismatch between pre-training and fine-tuning, since the **[MASK]** token does not appear during fine-tuning. To mitigate this, we do not always replace “masked” words with the actual **[MASK]** token. 

**1 Problem:** Language models only use left context or right context, but language understanding is bidirectional.

● Why are LMs unidirectional?

*Reason 1:* Directionality is needed to generate a
well-formed probability distribution.

    ○ We don’t care about this.
● 
*Reason 2:* Words can “see themselves” in a bidirectional encoder.

**Solution:** Mask out k% of the input words, 
and then predict the masked words
○ We always use k = 15%


                                store                 gallon
                                  |                    |
          the man went to the **[MASK]**  to buy a **[MASK]**  of milk

● Too little masking: Too expensive to train

● Too much masking: Not enough context



**2 Problem:** Mask token never seen at fine-tuning

**Solution:** 15% of the words to predict, but don’t
replace with **[MASK]** 100% of the time. Instead:

● 80% of the time, replace with **[MASK]** 
went to the store → went to the **[MASK]** 

● 10% of the time, replace random word
went to the store → went to the running

● 10% of the time, keep same
went to the store → went to the store


If the i-th token is chosen, we replace the i-th token with Then, Ti will be used to predict the original token with cross entropy loss.



In [1]:
!pip install transformers
# OPTIONAL: if you want to have more information on what's happening under the hood, activate the logger as follows
import logging
logging.basicConfig(level=logging.INFO)
import torch
from transformers import BertTokenizer, BertModel, BertForMaskedLM

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/50/10/aeefced99c8a59d828a92cc11d213e2743212d3641c87c82d61b035a7d5c/transformers-2.3.0-py3-none-any.whl (447kB)
[K     |████████████████████████████████| 450kB 3.4MB/s eta 0:00:01
[?25hCollecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/74/f4/2d5214cbf13d06e7cb2c20d84115ca25b53ea76fa1f0ade0e3c9749de214/sentencepiece-0.1.85-cp36-cp36m-manylinux1_x86_64.whl (1.0MB)
[K     |████████████████████████████████| 1.0MB 15.9MB/s eta 0:00:01
[?25hCollecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/a6/b4/7a41d630547a4afd58143597d5a49e07bfd4c42914d8335b2a5657efc14b/sacremoses-0.0.38.tar.gz (860kB)
[K     |████████████████████████████████| 870kB 19.7MB/s eta 0:00:01
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.38-cp36-none-any.

INFO:transformers.file_utils:PyTorch version 1.3.1 available.


In [3]:
# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

INFO:transformers.file_utils:https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt not found in cache or force_download set to True, downloading to /tmp/tmp71n9nbk5
INFO:transformers.file_utils:copying /tmp/tmp71n9nbk5 to cache at /root/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
INFO:transformers.file_utils:creating metadata file for /root/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
INFO:transformers.file_utils:removing temp file /tmp/tmp71n9nbk5
INFO:transformers.tokenization_utils:loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at /root/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c3

In [14]:
# Tokenize input
text  = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
tokenized_text = tokenizer.tokenize(text)
tokenized_text.T

['[CLS]',
 'who',
 'was',
 'jim',
 'henson',
 '?',
 '[SEP]',
 'jim',
 'henson',
 'was',
 'a',
 'puppet',
 '##eer',
 '[SEP]']

Here MASK is a word we will predict between the sentence


**TOKEN EMBEDDINGS**

In [None]:
# Mask a token that we will try to predict back with `BertForMaskedLM`
masked_index = 8
tokenized_text[masked_index] = '[MASK]'
assert tokenized_text == ['[CLS]', 'who', 'was', 'jim', 'henson', '?', '[SEP]', 'jim', '[MASK]', 'was', 'a', 'puppet', '##eer', '[SEP]']


In [16]:
tokenized_text

['[CLS]',
 'who',
 'was',
 'jim',
 'henson',
 '?',
 '[SEP]',
 'jim',
 '[MASK]',
 'was',
 'a',
 'puppet',
 '##eer',
 '[SEP]']

**POSITION EMBEDDINGS**

In [20]:
# Convert token to vocabulary indices
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
indexed_tokens

[101,
 2040,
 2001,
 3958,
 27227,
 1029,
 102,
 3958,
 103,
 2001,
 1037,
 13997,
 11510,
 102]

**SEGMENT EMBEDDINGS**

Here from the tokenized tokens which are part of one sentence we indexing with a 0,1 respectively for each sentence.

In [None]:
# Define sentence A and B indices associated to 1st and 2nd sentences (see paper)
segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]



In [None]:
# Convert inputs to PyTorch tensors
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])

In [22]:
# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased')

INFO:transformers.file_utils:https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-config.json not found in cache or force_download set to True, downloading to /tmp/tmppr8sras4
INFO:transformers.file_utils:copying /tmp/tmppr8sras4 to cache at /root/.cache/torch/transformers/4dad0251492946e18ac39290fcfe91b89d370fee250efe9521476438fe8ca185.bf3b9ea126d8c0001ee8a1e8b92229871d06d36d8808208cc2449280da87785c
INFO:transformers.file_utils:creating metadata file for /root/.cache/torch/transformers/4dad0251492946e18ac39290fcfe91b89d370fee250efe9521476438fe8ca185.bf3b9ea126d8c0001ee8a1e8b92229871d06d36d8808208cc2449280da87785c
INFO:transformers.file_utils:removing temp file /tmp/tmppr8sras4
INFO:transformers.configuration_utils:loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-config.json from cache at /root/.cache/torch/transformers/4dad0251492946e18ac39290fcfe91b89d370fee250efe9521476438fe8ca185.bf3b9ea126d8c0001ee8a1e8b92229871d06d3

**Hyperparameter Tuning**
  1. layer_norm_eps: 1e-12
  2. max_position_embeddings: 512
  3. num_attention_heads: 12
  4. num_hidden_layers: 12
  5. num_labels: 2
  10. torchscript: false
  11. type_vocab_size: 2
  13. vocab_size: 30522

In [23]:
# Set the model in evaluation mode to deactivate the DropOut modules
# This is IMPORTANT to have reproducible results during evaluation!
model.eval()

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          

In [24]:
# If you have a GPU, put everything on cuda
tokens_tensor = tokens_tensor.to('cuda')
segments_tensors = segments_tensors.to('cuda')
model.to('cuda')


BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          

In [None]:

# Predict hidden states features for each layer
with torch.no_grad():
    # See the models docstrings for the detail of the inputs
    outputs = model(tokens_tensor, token_type_ids=segments_tensors)
    # Transformers models always output tuples.
    # See the models docstrings for the detail of all the outputs
    # In our case, the first element is the hidden state of the last layer of the Bert model
    encoded_layers = outputs[0]


In [26]:
encoded_layers.shape


torch.Size([1, 14, 768])

 So from the above shape we have 3D tensor.
 **SHAPE(BATCH SIZE,TOKENS,HIDDEN UNITS)**

>  **1** represents one sentence

>  **14** represents number of total tokens after tokenization

>  **768** the number of hidden units in the **Bert model Uncased**.



In [34]:
print(len(indexed_tokens), model.config.hidden_size)

14 768


In [None]:
# We have encoded our input sequence in a FloatTensor of shape (batch size, sequence length, model hidden dimension)
assert tuple(encoded_layers.shape) == (1, len(indexed_tokens), model.config.hidden_size)

In [28]:
# Load pre-trained model (weights)
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
model.eval()

INFO:transformers.configuration_utils:loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-config.json from cache at /root/.cache/torch/transformers/4dad0251492946e18ac39290fcfe91b89d370fee250efe9521476438fe8ca185.bf3b9ea126d8c0001ee8a1e8b92229871d06d36d8808208cc2449280da87785c
INFO:transformers.configuration_utils:Model config {
  "attention_probs_dropout_prob": 0.1,
  "finetuning_task": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "is_decoder": false,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "num_labels": 2,
  "output_attentions": false,
  "output_hidden_states": false,
  "output_past": true,
  "pruned_heads": {},
  "torchscript": false,
  "type_vocab_siz

BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=Tr

In [29]:
# If you have a GPU, put everything on cuda
tokens_tensor = tokens_tensor.to('cuda')
segments_tensors = segments_tensors.to('cuda')
model.to('cuda')


BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=Tr

In [None]:
# Predict all tokens
with torch.no_grad():
    outputs = model(tokens_tensor, token_type_ids=segments_tensors)
    predictions = outputs[0]


In [31]:
predictions

tensor([[[ -7.8798,  -7.7874,  -7.7861,  ...,  -7.0438,  -6.7454,  -4.6013],
         [-13.3633, -13.7694, -13.7819,  ..., -11.8128, -11.1635, -13.8906],
         [-10.9775, -10.5383, -10.9659,  ..., -11.5549,  -8.0309,  -6.3979],
         ...,
         [ -5.2284,  -5.6572,  -5.3550,  ...,  -3.4507,  -3.8718,  -8.6904],
         [ -8.5290,  -8.4146,  -9.0744,  ...,  -7.1710,  -6.9877,  -6.1301],
         [-12.5968, -12.3769, -12.4222,  ..., -10.1020,  -9.8764,  -9.4495]]],
       device='cuda:0')

In [32]:
# confirm we were able to predict 'henson'
predicted_index = torch.argmax(predictions[0, masked_index]).item()

predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
print(predicted_index,predicted_token)


27227 henson


In [None]:
assert predicted_token == 'henson'