### Test: BERT large uncased

Model reference and usage description: https://huggingface.co/google-bert/bert-large-uncased

Example usage: Feeding an input text.

In [3]:
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased')
model = BertModel.from_pretrained("bert-large-uncased")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

BERT does transform the input text / input tokens into the vector space.  
The `last_hidden_state` contains the final embeddings for each token in the sequence.

The pooler applies a fully connected layer to the `[CLS]` token to obtain a condensed representation of the entire sequence.  
This condensed representation can be used in a classification task, as this single vector now represents the whole input sequence.

In [6]:
last_hidden_state = output.last_hidden_state
pooler_output = output.pooler_output

# Print the shapes of the outputs
print("Last Hidden State Shape:", last_hidden_state.shape) # Shape: [batch_size, sequence_length, hidden_size]
print("Pooler Output Shape:", pooler_output.shape) # Shape: [batch_size, hidden_size]

# Decode the tokens back to see how BERT splits the input
tokens = tokenizer.convert_ids_to_tokens(encoded_input['input_ids'][0])
print(f"Tokens: {tokens}")
print(f"Embedding of first token (CLS): {last_hidden_state[0][0]}")  # CLS token embedding
print(f"Embedding of the first word token: {last_hidden_state[0][1]}")  # First word's token embedding

# Embedding of the CLS token after pooling
print(f"Pooler Output (CLS after pooling): {pooler_output}")

Last Hidden State Shape: torch.Size([1, 12, 1024])
Pooler Output Shape: torch.Size([1, 1024])
Tokens: ['[CLS]', 'replace', 'me', 'by', 'any', 'text', 'you', "'", 'd', 'like', '.', '[SEP]']
Embedding of first token (CLS): tensor([-0.1534, -0.9412, -0.6168,  ..., -0.7690, -0.0030,  0.2449],
       grad_fn=<SelectBackward0>)
Embedding of the first word token: tensor([-0.5923, -0.7163, -0.9268,  ...,  0.4954,  0.4566,  0.0285],
       grad_fn=<SelectBackward0>)
Pooler Output (CLS after pooling): tensor([[-0.9995, -0.9970,  1.0000,  ..., -1.0000,  0.9944, -0.9978]],
       grad_fn=<TanhBackward0>)


Explanation:  
batch_size: Number of model inputs. If multiple sentences are processed in parallel, the batch size will be greater than 1.  
sequence_length: number of tokens in a given input sentence or text. BERT has a maximum sequence length of 512 tokens !!!  
hidden_size: fixed number of neurons in each layer. `bert-base-uncased` has 768, `bert-large-uncased` has 1024 neurons per layer.

### Test: Sentence-BERT (SBERT)
This version of BERT is optimized for comparing sentence similarity or finding related pairs.

requirement: `pip3 install -U sentence-transformers`

In [10]:
from sentence_transformers import SentenceTransformer, util
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2', device=device)

# Texts
short_text = (
    "I'm looking for a white diesel van with automatic transmission, registered recently, "
    "less than 30,000 km driven, and EURO 6 emission class."
)

long_text = """
Information:
Category: Van / minibus, 5 door. Engine type: Diesel. Fuel type: Diesel. Emission class: EURO 6. 
CO2 emissions: 173 g/km (combined). Power output: 110 KW / 150 PS. First registration: 12.2023. 
KBA Key Manufacturer: 0603. KBA Key Type: CQJ. VIN: WV2ZZZST2RH******. Transmission: Automatic. 
Colour: white (Sonderlackierung Candy-weiss). Read mileage: 21,800 Kilometres. 
Owners: 1. Location: D-73. Vehicle release: 3 working days after payment.
"""

# Embeddings
short_emb = model.encode(short_text, convert_to_tensor=True)
long_emb = model.encode(long_text, convert_to_tensor=True)

# Cosine similarity
score = util.cos_sim(short_emb, long_emb).item()
print(f"Similarity score: {score:.4f}")

Similarity score: 0.7127


Test: Checking the cosine similarity for different car texts.

In [11]:
from sentence_transformers import SentenceTransformer, util
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2', device=device)

# Text
long_text = """
Information:
Category: Van / minibus, 5 door. Engine type: Diesel. Fuel type: Diesel. Emission class: EURO 6. 
CO2 emissions: 173 g/km (combined). Power output: 110 KW / 150 PS. First registration: 12.2023. 
KBA Key Manufacturer: 0603. KBA Key Type: CQJ. VIN: WV2ZZZST2RH******. Transmission: Automatic. 
Colour: white (Sonderlackierung Candy-weiss). Read mileage: 21,800 Kilometres. 
Owners: 1. Location: D-73. Vehicle release: 3 working days after payment.
"""

# 1. Matching query from before
short_text_1 = (
    "I'm looking for a white diesel van with automatic transmission, registered recently, "
    "less than 30,000 km driven, and EURO 6 emission class."
)

# 2. Close wording, but doesn't match key facts (manual, petrol, older, wrong emissions)
short_text_2 = (
    "I'm searching for a white petrol van with manual transmission, EURO 5 emissions, "
    "and around 80,000 kilometers mileage."
)

# 3. Completely different car (sports car, different body, color, purpose, etc.)
short_text_3 = (
    "I'm interested in a red convertible sports car with over 300 horsepower and leather seats."
)

# Encode all texts
long_emb = model.encode(long_text, convert_to_tensor=True)
queries = [short_text_1, short_text_2, short_text_3]
query_embs = model.encode(queries, convert_to_tensor=True)

# Compare each short query to the long description
for i, emb in enumerate(query_embs):
    sim = util.cos_sim(emb, long_emb).item()
    print(f"Similarity score (short_text_{i+1}): {sim:.4f}")

Similarity score (short_text_1): 0.7127
Similarity score (short_text_2): 0.6525
Similarity score (short_text_3): 0.3093
