### Test: BERT large uncased

Model reference and usage description: https://huggingface.co/google-bert/bert-large-uncased

Example usage: Feeding an input text.

In [3]:
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased')
model = BertModel.from_pretrained("bert-large-uncased")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

BERT does transform the input text / input tokens into the vector space.  
The `last_hidden_state` contains the final embeddings for each token in the sequence.

The pooler applies a fully connected layer to the `[CLS]` token to obtain a condensed representation of the entire sequence.  
This condensed representation can be used in a classification task, as this single vector now represents the whole input sequence.

In [6]:
last_hidden_state = output.last_hidden_state
pooler_output = output.pooler_output

# Print the shapes of the outputs
print("Last Hidden State Shape:", last_hidden_state.shape) # Shape: [batch_size, sequence_length, hidden_size]
print("Pooler Output Shape:", pooler_output.shape) # Shape: [batch_size, hidden_size]

# Decode the tokens back to see how BERT splits the input
tokens = tokenizer.convert_ids_to_tokens(encoded_input['input_ids'][0])
print(f"Tokens: {tokens}")
print(f"Embedding of first token (CLS): {last_hidden_state[0][0]}")  # CLS token embedding
print(f"Embedding of the first word token: {last_hidden_state[0][1]}")  # First word's token embedding

# Embedding of the CLS token after pooling
print(f"Pooler Output (CLS after pooling): {pooler_output}")

Last Hidden State Shape: torch.Size([1, 12, 1024])
Pooler Output Shape: torch.Size([1, 1024])
Tokens: ['[CLS]', 'replace', 'me', 'by', 'any', 'text', 'you', "'", 'd', 'like', '.', '[SEP]']
Embedding of first token (CLS): tensor([-0.1534, -0.9412, -0.6168,  ..., -0.7690, -0.0030,  0.2449],
       grad_fn=<SelectBackward0>)
Embedding of the first word token: tensor([-0.5923, -0.7163, -0.9268,  ...,  0.4954,  0.4566,  0.0285],
       grad_fn=<SelectBackward0>)
Pooler Output (CLS after pooling): tensor([[-0.9995, -0.9970,  1.0000,  ..., -1.0000,  0.9944, -0.9978]],
       grad_fn=<TanhBackward0>)


Explanation:  
batch_size: Number of model inputs. If multiple sentences are processed in parallel, the batch size will be greater than 1.  
sequence_length: number of tokens in a given input sentence or text. BERT has a maximum sequence length of 512 tokens !!!  
hidden_size: fixed number of neurons in each layer. `bert-base-uncased` has 768, `bert-large-uncased` has 1024 neurons per layer.