<a href="https://colab.research.google.com/github/vishnoitanuj/nlp_semantic_search/blob/main/NLP_for_Semantic_Search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Sentence Tranformers

In [None]:
!pip install sentence-transformers

In [2]:
from sentence_transformers import SentenceTransformer as st

In [None]:
model = st('all-mpnet-base-v2')

In [4]:
sentences = [
    "it caught him off guard that space smelled of seared steak",
    "she could not decide between painting her teeth or brushing her nails",
    "he thought there'd be sufficient time is he hid his watch",
    "the bees decided to have a mutiny against their queen",
    "the sign said there was road work ahead so she decided to speed up",
    "on a scale of one to ten, what's your favorite flavor of color?",
    "flying stinging insects rebelled in opposition to the matriarch"
]

In [5]:
embeddings = model.encode(sentences)

In [6]:
embeddings.shape

(7, 768)

In [11]:
from sentence_transformers.util import cos_sim

In [8]:
scores = cos_sim(embeddings[-1], embeddings[:-1])
scores

tensor([[ 0.1232,  0.1967,  0.0523,  0.6084,  0.1011, -0.0492]])

In [9]:
sentences[scores.argmax().item()]

'the bees decided to have a mutiny against their queen'

## Question Answering using DPR (facebook ai - Dense Passage Retreiver)

In [None]:
!pip install transformers

In [3]:
from transformers import DPRContextEncoderTokenizer, DPRContextEncoder, \
    DPRQuestionEncoderTokenizer, DPRQuestionEncoder

In [4]:
ctx_model = DPRContextEncoder.from_pretrained('facebook/dpr-ctx_encoder-single-nq-base')
ctx_tokenizer = DPRContextEncoderTokenizer.from_pretrained('facebook/dpr-ctx_encoder-single-nq-base')

question_model = DPRQuestionEncoder.from_pretrained('facebook/dpr-question_encoder-single-nq-base')
question_tokenizer = DPRQuestionEncoderTokenizer.from_pretrained('facebook/dpr-question_encoder-single-nq-base')

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'DPRQuestionEncoderTokenizer'. 
The class this function is called from is 'DPRContextEncoderTokenizer'.


In [5]:
questions = [
    "what is the capital city of australia?",
    "what is the best selling sci-fi book?",
    "how many searches are performed on Google?"
]

contexts = [
    "canberra is the capital city of australia",
    "what is the capital city of australia?",
    "the capital city of france is paris",
    "what is the best selling sci-fi book?",
    "sc-fi is a popular book genre read by millions",
    "the best-selling sci-fi book is dune",
    "how many searches are performed on Google?",
    "Google serves more than 2 trillion queries annually",
    "Google is a popular search engine"
]

In [6]:
xb_tokens = ctx_tokenizer(contexts, max_length=256, padding='max_length', truncation=True, return_tensors='pt')   # pt is return as pytorch tensor
xb = ctx_model(**xb_tokens)

xq_tokens = question_tokenizer(questions, max_length=256, padding='max_length', truncation=True, return_tensors='pt')   # pt is return as pytorch tensor
xq = question_model(**xq_tokens)

In [7]:
xb.keys()

odict_keys(['pooler_output'])

In [8]:
xb.pooler_output.shape, xq.pooler_output.shape

(torch.Size([9, 768]), torch.Size([3, 768]))

In [None]:
!pip install torch

In [13]:
import torch

for i,xq_vec in enumerate(xq.pooler_output):
  probs = cos_sim(xq_vec, xb.pooler_output)
  argmax = torch.argmax(probs)
  print(questions[i])
  print(contexts[argmax])
  print("-----")

what is the capital city of australia?
canberra is the capital city of australia
-----
what is the best selling sci-fi book?
the best-selling sci-fi book is dune
-----
how many searches are performed on Google?
how many searches are performed on Google?
-----
