# CE5.2: SentenceBERT and Co. - School Solution

This notebook will demonstrate basic *semantic search* using sentence embeddings pre-trained for this purpose.

**Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks**

[The SentenceTranformers Python package home](https://sbert.net/)

[The SentenceTranformers GitHub](https://github.com/UKPLab/sentence-transformers)

[The SentenceBERT paper](https://arxiv.org/pdf/1908.10084.pdf)


The models we'll examine were first fine-tuned on the AllNLI dataset, then on train set of STS benchmark. They are specifically well suited for semantic textual similarity. For more details, see: sts-models.md.

**Pretrained Models:**

bert-base-nli-stsb-mean-tokens: Performance: STSbenchmark: 85.14

bert-large-nli-stsb-mean-tokens: Performance: STSbenchmark: 85.29

roberta-base-nli-stsb-mean-tokens: Performance: STSbenchmark: 85.44 

roberta-large-nli-stsb-mean-tokens: Performance: STSbenchmark: 86.39 

distilbert-base-nli-stsb-mean-tokens: Performance: STSbenchmark: 84.38 

## 1. Installation & Imports

In [1]:
# !pip install -U sentence-transformers

In [6]:
from sentence_transformers import SentenceTransformer

## 2. A Toy Dataset

In [7]:
sentences = ['Absence of sanity',
             'Lack of saneness',
             'A man is eating food.',
             'A man is eating a piece of bread.',
             'The girl is carrying a baby.',
             'A man is riding a horse.',
             'A woman is playing violin.',
             'Two men pushed carts through the woods.',
             'A man is riding a white horse on an enclosed ground.',
             'A monkey is playing drums.',
             'A cheetah is running behind its prey.']

## 3. Loading the model

In [3]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = ""

In [4]:
import torch
torch.has_mps

  torch.has_mps


True

In [5]:
model = SentenceTransformer('bert-base-nli-mean-tokens')

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.99k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/399 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

## 4. Embedding our datasets

In [19]:
sentence_embeddings_base = model.encode(sentences)

In [15]:
n = 3

for sentence, embedding in zip(sentences[:n], sentence_embeddings_base[:n]):
    print("Sentence:", sentence)
    print(f"Embedding shape: {embedding.shape}")
    print(f"Embedding: {embedding[:20]} ...")
    print("")

Sentence: Absence of sanity
Embedding shape: (768,)
Embedding: [ 0.29540247  0.29181156  2.1648014   0.22041972 -0.01308651  1.0195036
  1.5129826   0.23413222  0.27305812  0.13512346 -1.1131338  -0.12588474
  0.14537813  0.97770846  1.3935229   0.4577056  -0.582132   -0.7249414
 -0.3617342  -0.22751491] ...

Sentence: Lack of saneness
Embedding shape: (768,)
Embedding: [ 0.3043082   0.18374072  1.773273    0.32850876 -0.14961638  1.0655503
  1.5567325   0.30895364  0.2585117  -0.02292434 -1.2191778  -0.11834075
  0.09931859  0.8053728   1.1849424   0.44961277 -0.21068613 -0.8513134
 -0.32015172 -0.20306925] ...

Sentence: A man is eating food.
Embedding shape: (768,)
Embedding: [ 0.16618593  0.12440389  1.2497073  -0.53838134 -0.31307387  0.7524601
 -1.2488308   0.68713653 -0.6588014  -0.794196   -0.1289607   0.88119733
 -0.22051185  0.24356195  0.8588921  -0.3248815   0.07663598 -0.9614749
  0.43711475 -0.25115418] ...



Each sentence embedding has a shape of [768 x 1].

## 5. Semantic Search Using Sentence-BERT

In [17]:
query = 'Nobody has sane thoughts'  #  A query sentence uses for searching semantic similarity score.
queries = [query]
query_embeddings = model.encode(queries)

In [18]:
!pip install scipy
import scipy

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




In [21]:
number_top_matches = 5

In [24]:
DATABASE_EMBEDDINGS = sentence_embeddings_base

for query, query_embedding in zip(queries, query_embeddings):
    distances = scipy.spatial.distance.cdist([query_embedding], DATABASE_EMBEDDINGS, "cosine")[0]

    results = zip(range(len(distances)), distances)
    results = sorted(results, key=lambda x: x[1])
    print("Query:", query)
    print("\nTop 5 most similar sentences in corpus:")
    print("---------------------------------------")

    for idx, distance in results[0:number_top_matches]:
        print(sentences[idx].strip(), "(Cosine Score: %.4f)" % (1-distance))

Query: Nobody has sane thoughts

Top 5 most similar sentences in corpus:
---------------------------------------
Lack of saneness (Cosine Score: 0.8958)
Absence of sanity (Cosine Score: 0.8744)
A man is riding a horse. (Cosine Score: 0.1705)
A monkey is playing drums. (Cosine Score: 0.1687)
The girl is carrying a baby. (Cosine Score: 0.1521)


## 6. Try another model

In [23]:
qa_model = SentenceTransformer("multi-qa-mpnet-base-cos-v1")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/9.25k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [27]:
sentence_embeddings_qa = qa_model.encode(sentences)

In [42]:
list(qa_model.similarity(query_embeddings[0], sentence_embeddings_qa).numpy()[0])

[-0.06611297,
 -0.031156905,
 -0.017642125,
 -0.055026677,
 -0.045368895,
 -0.0358141,
 -0.039724793,
 0.029961023,
 -0.013393985,
 -0.043789636,
 -0.0762686]

In [55]:
query = 'Nobody has sane thoughts'  #  A query sentence uses for searching semantic similarity score.
queries = [query]
query_embeddings = qa_model.encode(queries)

In [67]:
def print_similar_sentences(query, model, sentence_embeddings, top_n):
    query_embedding = model.encode(query)
    similarities = list(model.similarity(query_embedding, sentence_embeddings).numpy()[0])
    similarities = [abs(x) for x in similarities]
    results = zip(range(len(similarities)), similarities)
    results = sorted(results, key=lambda x: x[1], reverse=True)
    # print(results)
    print("Query:", query)
    print(f"\nTop {top_n} most similar sentences in corpus:")
    print("---------------------------------------")
    for idx, similarity in results[0:top_n]:
        print(f"{sentences[idx].strip()}, (similarity: {float(similarity):.4f})")

In [68]:
print_similar_sentences('Nobody has sane thoughts', qa_model, sentence_embeddings_qa, 10)

Query: Nobody has sane thoughts

Top 10 most similar sentences in corpus:
---------------------------------------
Lack of saneness, (similarity: 0.6855)
Absence of sanity, (similarity: 0.5995)
A man is riding a horse., (similarity: 0.1522)
A man is riding a white horse on an enclosed ground., (similarity: 0.1343)
A man is eating a piece of bread., (similarity: 0.1219)
A monkey is playing drums., (similarity: 0.1210)
A man is eating food., (similarity: 0.1071)
Two men pushed carts through the woods., (similarity: 0.0932)
A woman is playing violin., (similarity: 0.0927)
The girl is carrying a baby., (similarity: 0.0710)


In [69]:
print_similar_sentences('A man is riding a cheetah', qa_model, sentence_embeddings_qa, 10)

Query: A man is riding a cheetah

Top 10 most similar sentences in corpus:
---------------------------------------
A cheetah is running behind its prey., (similarity: 0.7349)
A man is riding a horse., (similarity: 0.6730)
A man is riding a white horse on an enclosed ground., (similarity: 0.5725)
A monkey is playing drums., (similarity: 0.4466)
A man is eating food., (similarity: 0.4344)
A man is eating a piece of bread., (similarity: 0.4189)
Two men pushed carts through the woods., (similarity: 0.4040)
A woman is playing violin., (similarity: 0.3539)
The girl is carrying a baby., (similarity: 0.3367)
Lack of saneness, (similarity: 0.0715)
