# A Simple BERT Example

### Import PyTorch and Hugging Face libraries

In [1]:
import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM
from transformers import logging

logging.set_verbosity_error()

### Load Tokenizer and Mode

Here, we're using __DistilBERT__, a small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less parameters than bert-base-uncased, runs 60% faster while preserving over 95% of BERT's performances as measured on the GLUE language understanding benchmark.

In [2]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

model = AutoModelForMaskedLM.from_pretrained("distilbert-base-uncased")

### Tokenize a sample input string

In [3]:
inputs = tokenizer("As a chef, I [MASK] to cook.", return_tensors="pt")

inputs.keys()

dict_keys(['input_ids', 'attention_mask'])

In [4]:
inputs['input_ids']

tensor([[  101,  2004,  1037, 10026,  1010,  1045,   103,  2000,  5660,  1012,
           102]])

Note how the special CLS token id (101) is __prepended__ to the input string and the SEP token id (102) is __appended__ to the input string, and the MASK token id (103) is used to replace the special '[MASK]' token.

### We then send the tokenized inputs through the model

In [5]:
with torch.no_grad():

    outputs = model(**inputs, output_hidden_states=True)

### Now, we can obtain the hidden states

In [6]:
last_hidden_states = outputs.hidden_states[-1]

last_hidden_states.size()

torch.Size([1, 11, 768])

The dimentions for the last hiddent states are:
 - 1 = the number of inputs we sent through the model (e.g. the batch size)
 - 11 = the number of token ids in the tokenized input sequence
 - 768 = the hidden dimension from the model

In [7]:
last_hidden_states

tensor([[[-6.9379e-02,  1.4208e-01,  8.8803e-02,  ..., -2.6940e-04,
           7.9767e-02,  1.6876e-01],
         [ 1.7806e-01,  6.9932e-01,  2.2402e-01,  ..., -3.2417e-01,
           2.0508e-01, -2.6537e-01],
         [-2.0196e-02,  4.7216e-01, -1.5619e-01,  ..., -2.1885e-01,
           1.8057e-01, -5.4480e-02],
         ...,
         [ 7.4230e-01,  2.5037e-01, -2.8278e-03,  ..., -2.4099e-01,
           3.7877e-02, -3.2270e-01],
         [ 5.3188e-01,  1.7573e-01, -3.9323e-01,  ...,  2.8174e-01,
          -4.5808e-01, -3.8728e-01],
         [-3.3475e-01,  2.5727e-01,  2.1890e-01,  ...,  4.3778e-02,
           5.7937e-01, -1.4426e-01]]])

### Predicting the MASK token

In [8]:
# retrieve index of [MASK]
mask_token_index = (inputs.input_ids == tokenizer.mask_token_id)[0].nonzero(as_tuple=True)[0]

mask_token_index

tensor([6])

In [9]:
# Use the logits to get the predicted ID

logits = outputs.logits

predicted_token_id = logits[0, mask_token_index].argmax(axis=-1)

predicted_token_id

tensor([4342])

In [10]:
tokenizer.decode(predicted_token_id)

'learned'