<a href="https://colab.research.google.com/github/rahiakela/transformers-research-and-practice/blob/main/sentence-transformer-works/03_semantic_search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Semantic Search

**Reference**:

[Semantic Search](https://www.sbert.net/examples/applications/semantic-search/README.html)

In [None]:
!pip install sentence-transformers

In [3]:
from sentence_transformers import SentenceTransformer, util

import torch

##Search

In [None]:
model = SentenceTransformer("all-MiniLM-L6-v2")

In [8]:
# Corpus with example sentences
corpus = [
    'A man is eating food.',
    'A man is eating a piece of bread.',
    'The girl is carrying a baby.',
    'A man is riding a horse.',
    'A woman is playing violin.',
    'Two men pushed carts through the woods.',
    'A man is riding a white horse on an enclosed ground.',
    'A monkey is playing drums.',
    'A cheetah is running behind its prey.'
]

# Compute embedding for both lists
corpus_embeddings = model.encode(corpus, convert_to_tensor=True)

# Query sentences
queries = [
    'A man is eating pasta.',
    'Someone in a gorilla costume is playing a set of drums.',
    'A cheetah chases prey on across a field.'
]

# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = min(5, len(corpus))
for query in queries:
  query_embedding = model.encode(query, convert_to_tensor=True)
  # We use cosine-similarity and torch.topk to find the highest 5 scores
  cosine_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
  top_results = torch.topk(cosine_scores, k=top_k)

  print("\n\n======================\n\n")
  print("Query:", query)
  print("\nTop 5 most similar sentences in corpus:")

  # Output the pairs with their score
  for score, idx in zip(top_results[0], top_results[1]):
    #print(score, idx)
    print(corpus[idx], "(Score: {:.4f})".format(score))





Query: A man is eating pasta.

Top 5 most similar sentences in corpus:
A man is eating food. (Score: 0.7035)
A man is eating a piece of bread. (Score: 0.5272)
A man is riding a horse. (Score: 0.1889)
A man is riding a white horse on an enclosed ground. (Score: 0.1047)
A cheetah is running behind its prey. (Score: 0.0980)




Query: Someone in a gorilla costume is playing a set of drums.

Top 5 most similar sentences in corpus:
A monkey is playing drums. (Score: 0.6433)
A woman is playing violin. (Score: 0.2564)
A man is riding a horse. (Score: 0.1389)
A man is riding a white horse on an enclosed ground. (Score: 0.1191)
A cheetah is running behind its prey. (Score: 0.1080)




Query: A cheetah chases prey on across a field.

Top 5 most similar sentences in corpus:
A cheetah is running behind its prey. (Score: 0.8253)
A man is eating food. (Score: 0.1399)
A monkey is playing drums. (Score: 0.1292)
A man is riding a white horse on an enclosed ground. (Score: 0.1097)
A man is riding a 

Alternatively, we can also use `util.semantic_search` to perform cosine similarty + topk.

In [9]:
hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=5)

# Get the hits for the first query
hits = hits[0]
for hit in hits:
  print(corpus[hit['corpus_id']], "(Score: {:.4f})".format(hit['score']))

A cheetah is running behind its prey. (Score: 0.8253)
A man is eating food. (Score: 0.1399)
A monkey is playing drums. (Score: 0.1292)
A man is riding a white horse on an enclosed ground. (Score: 0.1097)
A man is riding a horse. (Score: 0.0650)
