<a href="https://colab.research.google.com/github/rahiakela/nlp-research-and-practice/blob/main/nlp-for-semantic-search/2_sentence_transformers_basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Sentence Transformers Basics

**Reference**:

https://www.pinecone.io/learn/sentence-embeddings/#other-sentence-transformers


##Setup

In [None]:
%%capture

!pip -q install sentence-transformers

In [None]:
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim, dot_score, semantic_search

import numpy as np
import pandas as pd

##Getting Started with Sentence Transformers

We will start with the original SBERT model `bert-base-nli-mean-tokens`.

First, we download and initialize the model.

In [None]:
model = SentenceTransformer("bert-base-nli-mean-tokens")

In [None]:
model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

The output we can see here is the `SentenceTransformer` object which contains three components:

- The `transformer` itself, here we can see the max sequence length of 128 tokens and whether to lowercase any input (in this case, the model does not). We can also see the model class, `BertModel`.
- The `pooling` operation, here we can see that we are producing a 768-dimensional sentence embedding. We are doing this using the mean pooling method.

Once we have the model, building sentence embeddings is quickly done using the `encode` method.

In [None]:
sentences = [
   "the fifty mannequin heads floating in the pool kind of freaked them out",
    "she swore she just saw her sushi move",
    "he embraced his new life as an eggplant",
    "my dentist tells me that chewing bricks is very bad for your teeth",
    "the dental specialist recommended an immediate stop to flossing with construction materials"
]

In [None]:
embeddings = model.encode(sentences)
embeddings.shape

(5, 768)

We can put together a fast STS example using nothing more than a cosine similarity function and Numpy.

In [None]:
sim = np.zeros((len(sentences), len(sentences)))

for i in range(len(sentences)):
  sim[i:, i] = cos_sim(embeddings[i], embeddings[i:])

sim

array([[1.        , 0.        , 0.        , 0.        , 0.        ],
       [0.40914321, 1.        , 0.        , 0.        , 0.        ],
       [0.10909033, 0.44547975, 1.00000024, 0.        , 0.        ],
       [0.50074875, 0.30693948, 0.20791672, 1.00000024, 0.        ],
       [0.29936218, 0.38607219, 0.28499281, 0.63849497, 1.        ]])

<img src='https://d33wubrfki0l68.cloudfront.net/0c2178d0eadcc9c5755526afda3322b0efbf0b87/8059a/images/sentence-embeddings-9.jpg?raw=1' width='600'/>

Here we have calculated the cosine similarity between every combination of our five sentence embeddings. Which are:

In [None]:
# Creating pandas dataframe from numpy array
df = pd.DataFrame({"Sentence": sentences[:]})
df

Unnamed: 0,Sentence
0,the fifty mannequin heads floating in the pool...
1,she swore she just saw her sushi move
2,he embraced his new life as an eggplant
3,my dentist tells me that chewing bricks is ver...
4,the dental specialist recommended an immediate...


We can see the highest similarity score in the bottom-right corner with 0.64. As we would hope, this is for sentences 4 and 3, which both describe poor dental practices using construction materials.

Let's find similiarity with another sentence.

In [None]:
query_sentence = "Dentist advise me for brushing every morning."
query_embeddings = model.encode([query_sentence])
query_embeddings.shape

(1, 768)

In [None]:
match_scores = cos_sim(query_embeddings, embeddings)
match_scores

tensor([[0.1861, 0.4667, 0.4662, 0.3910, 0.5217]])

In [None]:
max(match_scores[0])

tensor(0.5217)

In [None]:
# get max index value
max_idx = np.argmax(match_scores[0])
match_scores[0][max_idx]

tensor(0.5217)

In [None]:
sentences[max_idx]

'the dental specialist recommended an immediate stop to flossing with construction materials'

##Other sentence-transformers

These newer models can significantly outperform the original SBERT.

In [None]:
models = ["all-mpnet-base-v2", "all-roberta-large-v1", "all-MiniLM-L12-v1"]
performances = [63.30, 53.05, 59.80]
speed = [2800, 800, 7500]
sizes = [418, 1355, 118]
df = pd.DataFrame({"Model": models[:], "Avg. Performance": performances[:], "Speed": speed[:], "Size(MB)": sizes[:]})
df

Unnamed: 0,Model,Avg. Performance,Speed,Size(MB)
0,all-mpnet-base-v2,63.3,2800,418
1,all-roberta-large-v1,53.05,800,1355
2,all-MiniLM-L12-v1,59.8,7500,118


For now, let’s compare one of the highest performers and run through our STS task.

In [None]:
mpnet = SentenceTransformer("all-mpnet-base-v2")

In [None]:
mpnet

SentenceTransformer(
  (0): Transformer({'max_seq_length': 384, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)

Let’s compare the STS results of `all-mpnet-base-v2` against SBERT.

In [None]:
embeddings = mpnet.encode(sentences)

sim = np.zeros((len(sentences), len(sentences)))

for i in range(len(sentences)):
  sim[i:, i] = cos_sim(embeddings[i], embeddings[i:])

sim

array([[ 1.        ,  0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.26406282,  1.        ,  0.        ,  0.        ,  0.        ],
       [ 0.16503504,  0.16126695,  1.        ,  0.        ,  0.        ],
       [ 0.0433446 ,  0.04615873,  0.0567013 ,  1.        ,  0.        ],
       [ 0.05398502,  0.06101182, -0.01122264,  0.51847196,  0.99999988]])

<img src='https://d33wubrfki0l68.cloudfront.net/988cdd0f93df1ffebf12f96621ffdabf861684b1/f8cb5/images/sentence-embeddings-10.jpg?raw=1' width='600'/>

The semantic representation of later models is apparent. Although SBERT correctly identifies 4 and 3 as the most similar pair, it also assigns reasonably high similarity to other sentence pairs.

On the other hand, the MPNet model makes a very clear distinction between similar and dissimilar pairs, with most pairs scoring less than 0.1 and the 4-3 pair scored at 0.52.

In [None]:
query_sentence = "Dentist advise me for brushing every morning."
query_embeddings = mpnet.encode([query_sentence])
query_embeddings.shape

(1, 768)

In [None]:
match_scores = cos_sim(query_embeddings, embeddings)
print(match_scores)

# get max index value
max_idx = np.argmax(match_scores[0])
match_scores[0][max_idx]

tensor([[-0.1487, -0.0028, -0.0794,  0.3850,  0.4791]])


tensor(0.4791)

###Roberta

In [None]:
all_roberta = SentenceTransformer("all-roberta-large-v1")

In [None]:
all_roberta

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: RobertaModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)

In [None]:
embeddings = all_roberta.encode(sentences)

sim = np.zeros((len(sentences), len(sentences)))

for i in range(len(sentences)):
  sim[i:, i] = cos_sim(embeddings[i], embeddings[i:])

sim

array([[1.        , 0.        , 0.        , 0.        , 0.        ],
       [0.23111157, 1.        , 0.        , 0.        , 0.        ],
       [0.23029339, 0.20125577, 1.00000012, 0.        , 0.        ],
       [0.01426927, 0.00184732, 0.03006939, 1.        , 0.        ],
       [0.09576994, 0.04095338, 0.06597304, 0.57448077, 0.99999994]])

In [None]:
query_sentence = "Dentist advise me for brushing every morning."
query_embeddings = all_roberta.encode([query_sentence])
query_embeddings.shape

(1, 1024)

In [None]:
match_scores = cos_sim(query_embeddings, embeddings)
print(match_scores)

# get max index value
max_idx = np.argmax(match_scores[0])
match_scores[0][max_idx]

tensor([[-0.1120, -0.0761, -0.0563,  0.3797,  0.4866]])


tensor(0.4866)

###QA-MPNET

In [None]:
qa_mpnet = SentenceTransformer("multi-qa-mpnet-base-dot-v1")

In [None]:
qa_mpnet

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

In [None]:
embeddings = qa_mpnet.encode(sentences)

sim = np.zeros((len(sentences), len(sentences)))

for i in range(len(sentences)):
  sim[i:, i] = cos_sim(embeddings[i], embeddings[i:])

sim

array([[1.        , 0.        , 0.        , 0.        , 0.        ],
       [0.36049125, 1.        , 0.        , 0.        , 0.        ],
       [0.25880259, 0.32780212, 0.99999994, 0.        , 0.        ],
       [0.1674559 , 0.18139821, 0.19321227, 0.99999994, 0.        ],
       [0.21915776, 0.21057791, 0.17050785, 0.49836704, 1.        ]])

In [None]:
query_sentence = "Dentist advise me for brushing every morning."
query_embeddings = qa_mpnet.encode([query_sentence])
query_embeddings.shape

(1, 768)

In [None]:
match_scores = cos_sim(query_embeddings, embeddings)
print(match_scores)

# get max index value
max_idx = np.argmax(match_scores[0])
match_scores[0][max_idx]

tensor([[0.1042, 0.1929, 0.1514, 0.4884, 0.5005]])


tensor(0.5005)

###Multi QA Model

In [None]:
model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')

In [None]:
query_embedding = model.encode('How big is London')
passage_embedding = model.encode(['London has 9,787,426 inhabitants at the 2011 census',
                                  'London is known for its finacial district'])

print("Similarity:", dot_score(query_embedding, passage_embedding))
print("Similarity:", semantic_search(query_embedding, passage_embedding))

Similarity: tensor([[0.5472, 0.6330]])
Similarity: [[{'corpus_id': 1, 'score': 0.6329753398895264}, {'corpus_id': 0, 'score': 0.5472491979598999}]]


In [None]:
embeddings = model.encode(sentences)

In [None]:
query_sentence = "Dentist advise me for brushing every morning."
query_embeddings = model.encode([query_sentence])
query_embeddings.shape

(1, 384)

In [None]:
print("Similarity:", dot_score(query_embeddings, embeddings))
print("Similarity:", semantic_search(query_embeddings, embeddings))

Similarity: tensor([[-0.0323, -0.0472,  0.0586,  0.4577,  0.4823]])
Similarity: [[{'corpus_id': 4, 'score': 0.4823361039161682}, {'corpus_id': 3, 'score': 0.457660436630249}, {'corpus_id': 2, 'score': 0.058604899793863297}, {'corpus_id': 0, 'score': -0.032317135483026505}, {'corpus_id': 1, 'score': -0.04716726765036583}]]
