<a href="https://colab.research.google.com/github/rahiakela/natural-language-processing-research-and-practice/blob/main/nlp-for-semantic-search/1_dense_vectors.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Dense Vectors

Reference:

https://www.pinecone.io/learn/dense-vector-embeddings-nlp/

https://www.youtube.com/watch?v=bVZJ_O_-0RE

##Setup

In [None]:
!pip -q install transformers
!pip -q install sentence-transformers

In [20]:
from sentence_transformers import SentenceTransformer
from transformers import DPRContextEncoder, DPRContextEncoderTokenizer, DPRQuestionEncoder, DPRQuestionEncoderTokenizer

import torch

##Sentence Similarity

Let’s look at how we can quickly pull together some sentence embeddings using the sentence-transformers library.

In [None]:
model = SentenceTransformer("all-mpnet-base-v2")

Then we can go ahead and encode a few sentences, some more similar than others — while sharing very few matching words.

In [4]:
sentences = [
   "it caught him off guard that space smelled of seared steak",
    "she could not decide between painting her teeth or brushing her nails",
    "he thought there'd be sufficient time is he hid his watch",
    "the bees decided to have a mutiny against their queen",
    "the sign said there was road work ahead so she decided to speed up",
    "on a scale of one to ten, what's your favorite flavor of color?",
    "flying stinging insects rebelled in opposition to the matriarch"          
]

In [5]:
embeddings = model.encode(sentences)
embeddings.shape

(7, 768)

In [6]:
embeddings

array([[ 0.07613575,  0.03554152, -0.04853423, ...,  0.02156135,
        -0.02294155, -0.01750991],
       [ 0.02348459,  0.03777696, -0.0244106 , ..., -0.00101797,
         0.01494204, -0.00690051],
       [-0.01087208, -0.06204989, -0.02355074, ...,  0.04565978,
         0.00899556, -0.04353992],
       ...,
       [-0.00952114, -0.00817684, -0.0054946 , ..., -0.01066917,
         0.00550096, -0.01924828],
       [-0.00313652,  0.03131971, -0.00896536, ...,  0.04947064,
        -0.04866311, -0.00352198],
       [ 0.0035543 , -0.04229828,  0.01761915, ...,  0.01523382,
         0.01262348, -0.01886442]], dtype=float32)

In [7]:
from sentence_transformers.util import cos_sim

In [8]:
sentences

['it caught him off guard that space smelled of seared steak',
 'she could not decide between painting her teeth or brushing her nails',
 "he thought there'd be sufficient time is he hid his watch",
 'the bees decided to have a mutiny against their queen',
 'the sign said there was road work ahead so she decided to speed up',
 "on a scale of one to ten, what's your favorite flavor of color?",
 'flying stinging insects rebelled in opposition to the matriarch']

In [9]:
sentences[-1] # get last elemenet

'flying stinging insects rebelled in opposition to the matriarch'

In [10]:
sentences[:-1] # get all except last elemenet

['it caught him off guard that space smelled of seared steak',
 'she could not decide between painting her teeth or brushing her nails',
 "he thought there'd be sufficient time is he hid his watch",
 'the bees decided to have a mutiny against their queen',
 'the sign said there was road work ahead so she decided to speed up',
 "on a scale of one to ten, what's your favorite flavor of color?"]

And what does our sentence transformer produce from these sentences? A 768-dimensional dense representation of our sentence. The performance of these embeddings when compared using a similarity metric such as cosine similarity is, in most cases — excellent.

In [11]:
# get cosine of all elemenets with last element
scores = cos_sim(embeddings[-1], embeddings[:-1])

print(sentences[-1])
for i, score in enumerate(scores[0]):
  print(f"{round(score.item(), 4)} | {sentences[i]}")

flying stinging insects rebelled in opposition to the matriarch
0.1232 | it caught him off guard that space smelled of seared steak
0.1967 | she could not decide between painting her teeth or brushing her nails
0.0523 | he thought there'd be sufficient time is he hid his watch
0.6084 | the bees decided to have a mutiny against their queen
0.1011 | the sign said there was road work ahead so she decided to speed up
-0.0492 | on a scale of one to ten, what's your favorite flavor of color?


Despite our most semantically similar sentences about bees and their queen sharing zero descriptive words, our model correctly embeds these sentences in the closest vector space when measured with cosine similarity!

##Question-Answering

First, we initialize tokenizers and models for both our context (ctx) model and question model.

In [None]:
ctx_model = DPRContextEncoder.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base")
ctx_tokenizer = DPRContextEncoderTokenizer.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base")

In [None]:
question_model = DPRQuestionEncoder.from_pretrained("facebook/dpr-question_encoder-single-nq-base")
question_tokenizer = DPRQuestionEncoderTokenizer.from_pretrained("facebook/dpr-question_encoder-single-nq-base")

Given a question and several contexts we tokenize and encode like so:

In [16]:
questions = [
  "what is the capital city of australia?",
  "what is the best selling sci-fi book?",
  "how many searches are performed on Google?"          
]
contexts = [
  "canberra is the capital city of australia",
  "what is the capital city of australia?",
  "the capital city of france is paris",
  "what is the best selling sci-fi book?",
  "sc-fi is a popular book genre read by millions",
  "the best-selling sci-fi book is dune",
  "how many searches are performed on Google?",
  "Google serves more than 2 trillion queries annually",
  "Google is a popular search engine"
]

In [17]:
xb_tokens = ctx_tokenizer(contexts, max_length=256, padding="max_length", truncation=True, return_tensors="pt")
xb = ctx_model(**xb_tokens)

In [18]:
xq_tokens = question_tokenizer(questions, max_length=256, padding="max_length", truncation=True, return_tensors="pt")
xq = question_model(**xq_tokens)

In [19]:
xb.pooler_output.shape, xq.pooler_output.shape

(torch.Size([9, 768]), torch.Size([3, 768]))

Now we can compare our query embeddings xq against all of our context embeddings xb to see which are the most similar with cosine similarity.

In [21]:
for i, xq_vec in enumerate(xq.pooler_output):
  probs = cos_sim(xq_vec, xb.pooler_output)
  argmax = torch.argmax(probs)
  print(questions[i])
  print(contexts[argmax])
  print("---")

what is the capital city of australia?
canberra is the capital city of australia
---
what is the best selling sci-fi book?
the best-selling sci-fi book is dune
---
how many searches are performed on Google?
how many searches are performed on Google?
---


Out of our three questions, we returned two correct answers as the very top answer. It’s clear that DPR is not the perfect model, particularly when considering the simple nature of our questions and small dataset for DPR to retrieve from.

##Vision Transformers