<a href="https://colab.research.google.com/github/shaypal5/general_stuff/blob/master/SentenceBERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Basic Semantic Search Using Sentence Embeddings** \\
Sematically Similarity Using fine tuned pre-trained BERT.

**Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks** \\
[GitHub](https://github.com/UKPLab/sentence-transformers) \\
[Paper](https://arxiv.org/pdf/1908.10084.pdf) \\
Sentence-BERT (SBERT), a modification of the pretrained
BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity. This reduces the
effort for finding the most similar pair from 65
hours with BERT / RoBERTa to about 5 seconds with SBERT, while maintaining the accuracy from BERT

**Installation and Enviornment Setup**

In [None]:
!pip install -U sentence-transformers

Requirement already up-to-date: sentence-transformers in /usr/local/lib/python3.6/dist-packages (0.2.6.1)


**Sentences Embedding with a Pretrained Model**

In [None]:
from sentence_transformers import SentenceTransformer


The models were first fine-tuned on the AllNLI dataset, then on train set of STS benchmark. They are specifically well suited for semantic textual similarity. For more details, see: sts-models.md.

**Pretrained Models** \\
bert-base-nli-stsb-mean-tokens: Performance: STSbenchmark: 85.14 \\
bert-large-nli-stsb-mean-tokens: Performance: STSbenchmark: 85.29 \\
roberta-base-nli-stsb-mean-tokens: Performance: STSbenchmark: 85.44 \\
roberta-large-nli-stsb-mean-tokens: Performance: STSbenchmark: 86.39 \\
distilbert-base-nli-stsb-mean-tokens: Performance: STSbenchmark: 84.38 \\
**Source: "*Github Repository*"**

In [None]:
model = SentenceTransformer('bert-base-nli-mean-tokens')

In [None]:
sentences = ['Absence of sanity',
             'Lack of saneness',
             'A man is eating food.',
             'A man is eating a piece of bread.',
             'The girl is carrying a baby.',
             'A man is riding a horse.',
             'A woman is playing violin.',
             'Two men pushed carts through the woods.',
             'A man is riding a white horse on an enclosed ground.',
             'A monkey is playing drums.',
             'A cheetah is running behind its prey.']
sentence_embeddings_base = model.encode(sentences)

In [None]:
for sentence, embedding in zip(sentences, sentence_embeddings):
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")

Sentence: Absence of sanity
Embedding: [-0.32751286  0.61889     0.79470986 ...  0.02687939 -0.13736466
  2.2358234 ]

Sentence: Lack of saneness
Embedding: [-0.13622051  0.41023704  0.73581094 ... -0.19232634 -0.12812541
  1.9241054 ]

Sentence: A man is eating food.
Embedding: [-0.20701389  0.6842224  -0.6645474  ... -0.16564319 -1.1139905
  0.93136346]

Sentence: A man is eating a piece of bread.
Embedding: [-0.5298074   1.0654029  -0.6272791  ... -0.31372124 -0.42181608
 -0.10556436]

Sentence: The girl is carrying a baby.
Embedding: [ 0.0217128   0.62021583 -0.9170204  ... -0.16280569  1.2698941
  1.9319594 ]

Sentence: A man is riding a horse.
Embedding: [-0.54099154 -0.17697512 -0.13467026 ... -1.097092   -0.5963547
  1.5534692 ]

Sentence: A woman is playing violin.
Embedding: [-0.11929066 -0.04466006 -1.2879481  ...  0.3768997  -0.9470354
  1.0744518 ]

Sentence: Two men pushed carts through the woods.
Embedding: [ 0.0423027   0.07625704 -0.09663814 ... -0.08406684 -0.32605657

Each sentence embedding have a shape of [768 x 1]. \\
Let's play some more with other pre-trained model, for e.g "roberta-large-nli-mean-tokens". Let's have a look.  

---



In [None]:
model_roberta = SentenceTransformer('roberta-large-nli-mean-tokens')

In [None]:
sentences = ['Absence of sanity',
             'Lack of saneness',
             'A man is eating food.',
             'A man is eating a piece of bread.',
             'The girl is carrying a baby.',
             'A man is riding a horse.',
             'A woman is playing violin.',
             'Two men pushed carts through the woods.',
             'A man is riding a white horse on an enclosed ground.',
             'A monkey is playing drums.',
             'A cheetah is running behind its prey.']
sentence_embeddings = model_roberta.encode(sentences)

In [None]:
for sentence, embedding in zip(sentences, sentence_embeddings):
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")

Sentence: Absence of sanity
Embedding: [-0.32751286  0.61889     0.79470986 ...  0.02687939 -0.13736466
  2.2358234 ]

Sentence: Lack of saneness
Embedding: [-0.13622051  0.41023704  0.73581094 ... -0.19232634 -0.12812541
  1.9241054 ]

Sentence: A man is eating food.
Embedding: [-0.20701389  0.6842224  -0.6645474  ... -0.16564319 -1.1139905
  0.93136346]

Sentence: A man is eating a piece of bread.
Embedding: [-0.5298074   1.0654029  -0.6272791  ... -0.31372124 -0.42181608
 -0.10556436]

Sentence: The girl is carrying a baby.
Embedding: [ 0.0217128   0.62021583 -0.9170204  ... -0.16280569  1.2698941
  1.9319594 ]

Sentence: A man is riding a horse.
Embedding: [-0.54099154 -0.17697512 -0.13467026 ... -1.097092   -0.5963547
  1.5534692 ]

Sentence: A woman is playing violin.
Embedding: [-0.11929066 -0.04466006 -1.2879481  ...  0.3768997  -0.9470354
  1.0744518 ]

Sentence: Two men pushed carts through the woods.
Embedding: [ 0.0423027   0.07625704 -0.09663814 ... -0.08406684 -0.32605657

**Semantic Search** Using Sentence-BERT

In [None]:
query = 'Nobody has sane thoughts'  #  A query sentence uses for searching semantic similarity score.
queries = [query]
query_embeddings = model.encode(queries)

In [None]:
!pip install scipy
import scipy



In [None]:
print("Semantic Search Results")

for query, query_embedding in zip(queries, query_embeddings):
    distances = scipy.spatial.distance.cdist([query_embedding], sentence_embeddings_base, "cosine")[0]

    results = zip(range(len(distances)), distances)
    results = sorted(results, key=lambda x: x[1])
    print("Query:", query)
    print("\nTop 5 most similar sentences in corpus:")

    for idx, distance in results[0:number_top_matches]:
        print(sentences[idx].strip(), "(Cosine Score: %.4f)" % (1-distance))

Semantic Search Results
Query: Nobody has sane thoughts

Top 5 most similar sentences in corpus:
Lack of saneness (Cosine Score: 0.8958)
Absence of sanity (Cosine Score: 0.8744)
A man is riding a horse. (Cosine Score: 0.1705)
