Seeing is believing. In this notebook, we print out the output of BERT.

In [1]:
import torch
from transformers import BertTokenizer, BertModel

MODEL_NAME = "bert-base-cased"
SENTENCE = "Sarah went to a restaurant. She was not satisfied"

tokenizer = BertTokenizer.from_pretrained(MODEL_NAME)
model = BertModel.from_pretrained(MODEL_NAME)

inputs = tokenizer("Hello, my dog is cute", return_tensors='pt')

outputs = model(**inputs, output_hidden_states=True)

2024-02-25 22:30:07.679882: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-02-25 22:30:07.830936: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-02-25 22:30:08.590850: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/opt/cuda/lib64:/opt/cuda/lib:/opt/cuda/lib64:/opt/cuda/lib
2024-02-25 22:30:08.590959: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dl

`outputs` is of type `BaseModelOutputWithPoolingAndCrossAttentions`, in simple term a (customized) tuple containing 3 items

In [2]:
print(type(outputs))

len(outputs)

<class 'transformers.modeling_outputs.BaseModelOutputWithPoolingAndCrossAttentions'>


3

>**last_hidden_state** (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model.
>
>**pooler_output** (torch.FloatTensor of shape (batch_size, hidden_size)) — Last layer hidden-state of the first token of the sequence (classification token) after further processing through the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returns the classification token after processing through a linear layer and a tanh activation function. The linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.
>
>**hidden_states** (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

In [3]:
last_hidden_state = outputs.last_hidden_state
pooler_output = outputs.pooler_output
hidden_states = outputs.hidden_states

# last_hidden_state

In [4]:
print(f"last_hidden_states has shape {last_hidden_state.shape}")
print(last_hidden_state.detach().numpy())

last_hidden_states has shape torch.Size([1, 8, 768])
[[[ 0.51323897  0.5097055   0.19912957 ... -0.38999233  0.40526906
   -0.23153386]
  [ 0.5394626  -0.3658087   0.6667345  ... -0.39200187  0.25045085
    0.02019705]
  [ 0.7766632   0.6822611   0.7109605  ... -0.04200423 -0.37177894
    0.37482336]
  ...
  [ 0.35550103  0.44857284  0.61754423 ... -0.03878015 -0.26307565
    0.35140684]
  [ 0.7927245  -0.12816776  0.27373865 ... -0.521956    0.48364452
    0.09373149]
  [ 1.2903227   1.035556    0.50537765 ... -0.43437806  1.1972625
   -0.4235841 ]]]


# `hidden_states`
By definition, 
```python
hidden_states[-1] == last_hidden_state
```

In [7]:
len(hidden_states)

13

In [8]:
hidden_states[0].shape

torch.Size([1, 8, 768])

# `pooler_output`

is **NOT** the same as `last_hidden_state[:,0,:]`

>Last layer hidden-state of the first token of the sequence (classification token) after **further processing through the layers used for the auxiliary pretraining task**. E.g. for BERT-family of models, this returns the classification token after processing through a linear layer and a tanh activation function. The linear layer weights are trained from the **next sentence prediction**

In other words, `last_hidden_state[:,0,:]` is trained for next sentence prediction, resulting in `pooler_output`

In [99]:
pooler_output.grad_fn

<TanhBackward0 at 0x7df665f02940>

In [101]:
last_hidden_state.grad_fn

<NativeLayerNormBackward0 at 0x7df66625feb0>