#### Sentence Embeddings using Siamese BERT-Networks  
https://www.sbert.net/

**Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks** \\
[GitHub](https://github.com/UKPLab/sentence-transformers) \\
[Paper](https://arxiv.org/pdf/1908.10084.pdf) \\
Sentence-BERT (SBERT), a modification of the pretrained
BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity. This reduces the
effort for finding the most similar pair from 65
hours with BERT / RoBERTa to about 5 seconds with SBERT, while maintaining the accuracy from BERT

**Sentences Embedding with a Pretrained Model**

In [1]:
# !pip install --upgrade sentence-transformers

In [2]:
import scipy
import pandas as pd
import torch
from sentence_transformers import SentenceTransformer, util

pd.set_option('max_colwidth', 500)

**Pretrained Models**: https://huggingface.co/sentence-transformers?sort_models=downloads#models

The `all-mpnet-base-v2` model provides the best quality, while `all-MiniLM-L6-v2` is 5 times faster and still offers good quality.

In [3]:
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
# model = SentenceTransformer('microsoft/all-mpnet-base-v2')

### Sentence Similarity

In [4]:
sentences = ['Absence of sanity', 
             'Lack of saneness',
             'A man is eating food',
             'A man is eating a piece of bread',
             'The girl is carrying a baby',
             'A man is riding a horse',
             'A woman is playing violin',
             'Two men pushed carts through the woods',
             'A man is riding a white horse on an enclosed ground',
             'A monkey is playing drums',
             'A cheetah is running behind its prey']

#Encode all sentences
sentence_embeddings = model.encode(sentences, convert_to_tensor=True)

#Compute cosine similarity between all pairs
cos_sim = util.cos_sim(sentence_embeddings, sentence_embeddings)

#Add all pairs to a list with their cosine similarity score
all_sentence_combinations = []
for i in range(len(cos_sim)-1):
    for j in range(i+1, len(cos_sim)):
        all_sentence_combinations.append([cos_sim[i][j], i, j])

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [5]:
sentence_1 = []
sentence_2 = []
cosine_similarity = []

for score, i, j in all_sentence_combinations:
    sentence_1.append(sentences[i])
    sentence_2.append(sentences[j])
    cosine_similarity.append(cos_sim[i][j].item())
    
df = pd.DataFrame(zip(sentence_1, sentence_2, cosine_similarity), columns=['Sentense_1', 'Sentense_2', 'Similarity_Score'])
df.sort_values(by=['Similarity_Score'], ascending=False).head(10)

Unnamed: 0,Sentense_1,Sentense_2,Similarity_Score
0,Absence of sanity,Lack of saneness,0.743339
19,A man is eating food,A man is eating a piece of bread,0.737853
42,A man is riding a horse,A man is riding a white horse on an enclosed ground,0.72603
21,A man is eating food,A man is riding a horse,0.244987
47,A woman is playing violin,A monkey is playing drums,0.195667
24,A man is eating food,A man is riding a white horse on an enclosed ground,0.166425
28,A man is eating a piece of bread,A man is riding a horse,0.13834
26,A man is eating food,A cheetah is running behind its prey,0.136164
31,A man is eating a piece of bread,A man is riding a white horse on an enclosed ground,0.112944
54,A monkey is playing drums,A cheetah is running behind its prey,0.111993


### Query Search

In [6]:
# Query sentences:
queries = ['A man is eating pasta.', 'Someone in a gorilla costume is playing a set of drums.', 'A cheetah chases prey on across a field.']


query_list = []
sentence_list = []
cosine_similarity = []

# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = min(3, len(sentences))
for query in queries:
    query_embeddings = model.encode(query, convert_to_tensor=True)

    # We use cosine-similarity and torch.topk to find the highest N scores
    cos_scores = util.cos_sim(query_embeddings, sentence_embeddings)[0]
    top_results = torch.topk(cos_scores, k=top_k)

    for score, idx in zip(top_results[0], top_results[1]):
        # print(sentences[idx], "(Score: {:.4f})".format(score))
    
        query_list.append(query)
        sentence_list.append(sentences[idx])
        cosine_similarity.append(score.item())
    
df = pd.DataFrame(zip(query_list, sentence_list, cosine_similarity), columns=['Query', 'Sentense', 'Similarity_Score'])
df.sort_values(by=['Query', 'Similarity_Score'], ascending = [True, False]).head(10)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,Query,Sentense,Similarity_Score
6,A cheetah chases prey on across a field.,A cheetah is running behind its prey,0.81925
7,A cheetah chases prey on across a field.,A man is eating food,0.131582
8,A cheetah chases prey on across a field.,A monkey is playing drums,0.119997
0,A man is eating pasta.,A man is eating food,0.703525
1,A man is eating pasta.,A man is eating a piece of bread,0.516403
2,A man is eating pasta.,A man is riding a horse,0.188286
3,Someone in a gorilla costume is playing a set of drums.,A monkey is playing drums,0.642459
4,Someone in a gorilla costume is playing a set of drums.,A woman is playing violin,0.252789
5,Someone in a gorilla costume is playing a set of drums.,A man is riding a horse,0.132145


In [7]:
# model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
model = SentenceTransformer('all-mpnet-base-v2')

### Sentence Similarity

In [8]:
sentences = ['Absence of sanity', 
             'Lack of saneness',
             'A man is eating food',
             'A man is eating a piece of bread',
             'The girl is carrying a baby',
             'A man is riding a horse',
             'A woman is playing violin',
             'Two men pushed carts through the woods',
             'A man is riding a white horse on an enclosed ground',
             'A monkey is playing drums',
             'A cheetah is running behind its prey']

#Encode all sentences
sentence_embeddings = model.encode(sentences, convert_to_tensor=True)

#Compute cosine similarity between all pairs
cos_sim = util.cos_sim(sentence_embeddings, sentence_embeddings)

#Add all pairs to a list with their cosine similarity score
all_sentence_combinations = []
for i in range(len(cos_sim)-1):
    for j in range(i+1, len(cos_sim)):
        all_sentence_combinations.append([cos_sim[i][j], i, j])

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [9]:
sentence_1 = []
sentence_2 = []
cosine_similarity = []

for score, i, j in all_sentence_combinations:
    sentence_1.append(sentences[i])
    sentence_2.append(sentences[j])
    cosine_similarity.append(cos_sim[i][j].item())
    
df = pd.DataFrame(zip(sentence_1, sentence_2, cosine_similarity), columns=['Sentense_1', 'Sentense_2', 'Similarity_Score'])
df.sort_values(by=['Similarity_Score'], ascending=False).head(10)

Unnamed: 0,Sentense_1,Sentense_2,Similarity_Score
0,Absence of sanity,Lack of saneness,0.826907
42,A man is riding a horse,A man is riding a white horse on an enclosed ground,0.696171
19,A man is eating food,A man is eating a piece of bread,0.673686
21,A man is eating food,A man is riding a horse,0.243545
28,A man is eating a piece of bread,A man is riding a horse,0.218326
54,A monkey is playing drums,A cheetah is running behind its prey,0.162626
24,A man is eating food,A man is riding a white horse on an enclosed ground,0.157566
47,A woman is playing violin,A monkey is playing drums,0.156521
31,A man is eating a piece of bread,A man is riding a white horse on an enclosed ground,0.14504
44,A man is riding a horse,A cheetah is running behind its prey,0.077782


### Query Search

In [10]:
# Query sentences:
queries = ['A man is eating pasta.', 'Someone in a gorilla costume is playing a set of drums.', 'A cheetah chases prey on across a field.']


query_list = []
sentence_list = []
cosine_similarity = []

# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = min(3, len(sentences))
for query in queries:
    query_embeddings = model.encode(query, convert_to_tensor=True)

    # We use cosine-similarity and torch.topk to find the highest N scores
    cos_scores = util.cos_sim(query_embeddings, sentence_embeddings)[0]
    top_results = torch.topk(cos_scores, k=top_k)

    for score, idx in zip(top_results[0], top_results[1]):
        # print(sentences[idx], "(Score: {:.4f})".format(score))
    
        query_list.append(query)
        sentence_list.append(sentences[idx])
        cosine_similarity.append(score.item())
    
df = pd.DataFrame(zip(query_list, sentence_list, cosine_similarity), columns=['Query', 'Sentense', 'Similarity_Score'])
df.sort_values(by=['Query', 'Similarity_Score'], ascending = [True, False]).head(10)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,Query,Sentense,Similarity_Score
6,A cheetah chases prey on across a field.,A cheetah is running behind its prey,0.803264
7,A cheetah chases prey on across a field.,A man is riding a white horse on an enclosed ground,0.15082
8,A cheetah chases prey on across a field.,A monkey is playing drums,0.144822
0,A man is eating pasta.,A man is eating food,0.595814
1,A man is eating pasta.,A man is eating a piece of bread,0.419617
2,A man is eating pasta.,A man is riding a horse,0.192166
3,Someone in a gorilla costume is playing a set of drums.,A monkey is playing drums,0.653894
4,Someone in a gorilla costume is playing a set of drums.,A cheetah is running behind its prey,0.177457
5,Someone in a gorilla costume is playing a set of drums.,A woman is playing violin,0.117351


#### Fine-tuning Sentence BERT
https://huggingface.co/blog/how-to-train-sentence-transformers

In [11]:
import datetime
import pytz

datetime.datetime.now(pytz.timezone('US/Central')).strftime("%a, %d %B %Y %H:%M:%S")

'Wed, 09 November 2022 23:33:57'