# Building a Semantic Search Engine to Search for Queries with Transformers

# Semantic Search
Semantic search seeks to improve search accuracy by understanding the content of the search query. In contrast to traditional search engines, that only finds documents based on lexical matches, semantic search can also find synonyms.


## Background
The idea behind semantic search is to embedd all entries in your corpus, which can be sentences, paragraphs, or documents, into a vector space. 

At search time, the query is embedded into the same vector space and the closest embedding from your corpus are found. These entries should have a high semantic overlap with the query.

![SemanticSearch](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/SemanticSearch.png) 


## Similarity Computation

For small corpora (up to about 100k entries) we can compute the cosine-similarity between the query and all entries in the corpus.

For small corpora with few example sentences we compute the embeddings for the corpus as well as for our query.

We then use the [util.pytorch_cos_sim()](../../../docs/usage/semantic_textual_similarity.md) function to compute the cosine similarity between the query and all corpus entries.

For large corpora, sorting all scores would take too much time. Hence, we can use [torch.topk](https://pytorch.org/docs/stable/generated/torch.topk.html) to only get the top k entries.

[Reference](https://github.com/UKPLab/sentence-transformers/tree/master/examples/applications/semantic-search)


## Objective

For today's objective we will create a corpus of around 50000 question titles asked on Quora from an open dataset. Your task will be to compute sentence embeddings and then try to retrieve top 5 similar questions from the corpus for a few example queries mentioned below.

Use [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) which provides a scalable way to generate document embeddings using transformers



## Load Dependencies

In [1]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.13.0-py3-none-any.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 5.3 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 36.7 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.2.1-py3-none-any.whl (61 kB)
[K     |████████████████████████████████| 61 kB 231 kB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 37.1 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 39.5 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
 

In [2]:
!pip install -U sentence-transformers

Collecting sentence-transformers
  Downloading sentence-transformers-2.1.0.tar.gz (78 kB)
[?25l[K     |████▏                           | 10 kB 24.0 MB/s eta 0:00:01[K     |████████▍                       | 20 kB 26.9 MB/s eta 0:00:01[K     |████████████▌                   | 30 kB 12.6 MB/s eta 0:00:01[K     |████████████████▊               | 40 kB 9.6 MB/s eta 0:00:01[K     |████████████████████▉           | 51 kB 5.5 MB/s eta 0:00:01[K     |█████████████████████████       | 61 kB 5.6 MB/s eta 0:00:01[K     |█████████████████████████████▏  | 71 kB 5.7 MB/s eta 0:00:01[K     |████████████████████████████████| 78 kB 797 kB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 11.4 MB/s 
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone
  Created wheel for senten

In [3]:
import transformers

In [4]:
import pandas as pd
import numpy as np

## Download and Load Corpus of Questions

In [5]:
!wget http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv

--2021-12-13 19:23:16--  http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv
Resolving qim.fs.quoracdn.net (qim.fs.quoracdn.net)... 151.101.1.2, 151.101.65.2, 151.101.129.2, ...
Connecting to qim.fs.quoracdn.net (qim.fs.quoracdn.net)|151.101.1.2|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 58176133 (55M) [text/tab-separated-values]
Saving to: ‘quora_duplicate_questions.tsv’


2021-12-13 19:23:18 (186 MB/s) - ‘quora_duplicate_questions.tsv’ saved [58176133/58176133]



In [6]:
df = pd.read_csv('quora_duplicate_questions.tsv', sep='\t').head(25000)
df.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


In [7]:
corpus = df['question1'].tolist() + df['question2'].tolist()

In [8]:
len(corpus)

50000

## Use Sentence Transformers and Generate Corpus Embeddings

__Hint:__ You can use this tutorial as a reference

[Semantic Search Tutorial](https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/semantic-search/semantic_search.py)


# __Question 1__: Load Pre-trained Embedder Model

Load the __`roberta-large-nli-stsb-mean-tokens`__ model to generate embeddings

In [9]:
from sentence_transformers import SentenceTransformer

In [10]:
model = 'roberta-large-nli-stsb-mean-tokens'
embedder = SentenceTransformer('all-MiniLM-L6-v2')

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

# __Question 2__: Generate Corpus Embeddings

Generate embeddings for each and every document using the pre-trained model

In [11]:
corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)

In [12]:
corpus_embeddings.shape

torch.Size([50000, 384])

# __Question 3__: Create a function to print top K similar sentences for a given query

Use cosine similarity by leveraging the pytorch utility in `sentence_transformers` as depicted in the previously linked tutorial.

In [13]:
from sentence_transformers import util
import torch

def print_similar_sentences(query, model_embedder, corpus_embeddings, top_k):
    """
      query: this should be your input query
      model_embedder: this should be your embedding model (pre-trained model which you loaded earlier)
      corpus_embeddings: this should hold the embeddings you generate for your corpus
      top_k: the top k similar queries you should return
    """

    query_embedding = embedder.encode(query, convert_to_tensor=True)

    cos_scores = util.pytorch_cos_sim(query_embedding, corpus_embeddings)[0]
    top_results = torch.topk(cos_scores, k=top_k)

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 5 most similar sentences in corpus:")

    for score, idx in zip(top_results[0], top_results[1]):
        print(corpus[idx], "(Score: {:.4f})".format(score))

# __Question 4__: Perform Semantic Search on Sample Questions to get Similar Queries from the Corpus

In [14]:
s = 'What is the step by step guide to invest'
print_similar_sentences(query=s,
                        model_embedder=model, 
                        corpus_embeddings=corpus_embeddings,
                        top_k=5)





Query: What is the step by step guide to invest

Top 5 most similar sentences in corpus:
What is the step by step guide to invest in share market? (Score: 0.8322)
What is the step by step guide to invest in share market in india? (Score: 0.7443)
How should I start investment and buying shares? (Score: 0.6876)
What are the ways to get an investment for startup? (Score: 0.6730)
How do you join and start investing into the stock market? (Score: 0.6629)


In [15]:
s = 'What is Data Science?'
print_similar_sentences(query=s,
                        model_embedder=model, 
                        corpus_embeddings=corpus_embeddings,
                        top_k=5)





Query: What is Data Science?

Top 5 most similar sentences in corpus:
What is actually a data science? (Score: 0.9618)
What is data science (Score: 0.9559)
What is big data science? (Score: 0.8296)
What does a data scientist do? (Score: 0.7899)
What is the difference between data science and data analysis? (Score: 0.7798)


In [16]:
s = 'What is natural language processing?'
print_similar_sentences(query=s,
                        model_embedder=model, 
                        corpus_embeddings=corpus_embeddings,
                        top_k=5)





Query: What is natural language processing?

Top 5 most similar sentences in corpus:
How does natural language processing work? (Score: 0.8911)
Natural Language Processing: How widely used are context-free grammars in abstractive summarizations? (Score: 0.5558)
Which are the best schools for studying natural language processing? (Score: 0.5276)
What's formal language? (Score: 0.4948)
What is dialogue writing? (Score: 0.4914)


In [17]:
s = 'What is natural language processing?'
print_similar_sentences(query=s,
                        model_embedder=model, 
                        corpus_embeddings=corpus_embeddings,
                        top_k=5)





Query: What is natural language processing?

Top 5 most similar sentences in corpus:
How does natural language processing work? (Score: 0.8911)
Natural Language Processing: How widely used are context-free grammars in abstractive summarizations? (Score: 0.5558)
Which are the best schools for studying natural language processing? (Score: 0.5276)
What's formal language? (Score: 0.4948)
What is dialogue writing? (Score: 0.4914)


In [18]:
s = 'Best Harry Potter Movie?'
print_similar_sentences(query=s,
                        model_embedder=model, 
                        corpus_embeddings=corpus_embeddings,
                        top_k=5)





Query: Best Harry Potter Movie?

Top 5 most similar sentences in corpus:
Which is the best Harry Potter movie? (Score: 0.9695)
Which Harry Potter movie is the best? (Score: 0.9671)
Which is your favourite Harry Potter movie and why? (Score: 0.8852)
What is the best Harry Potter movie and why? Is it also your favorite? Why or why not? (Score: 0.8773)
What do you think about Harry Potter Films? (Score: 0.7294)


In [19]:
s = 'What is the best smartphone?'
print_similar_sentences(query=s,
                        model_embedder=model, 
                        corpus_embeddings=corpus_embeddings,
                        top_k=5)





Query: What is the best smartphone?

Top 5 most similar sentences in corpus:
What are the best smartphones? (Score: 0.9705)
What are the best smartphones? (Score: 0.9705)
Which is the best smartphone to buy now? (Score: 0.9303)
What is the best smartphone to date? (Score: 0.9029)
What are the best available smartphones gadgets? (Score: 0.8249)


In [20]:
s = 'What is the best starter pokemon?'
print_similar_sentences(query=s,
                        model_embedder=model, 
                        corpus_embeddings=corpus_embeddings,
                        top_k=5)





Query: What is the best starter pokemon?

Top 5 most similar sentences in corpus:
How do you choose the right starter pokemon in any game? (Score: 0.8757)
Which set of starter Pokemon would you choose considering all generations and why? (Score: 0.8113)
Which is your favourite Pokémon and why? (Score: 0.7180)
What is your favorite Pokémon (from any generation or game), and why? (Score: 0.7137)
What is the best team in Pokemon Red? (Score: 0.7123)


In [21]:
s = 'Batman or Superman?'
print_similar_sentences(query=s,
                        model_embedder=model, 
                        corpus_embeddings=corpus_embeddings,
                        top_k=5)





Query: Batman or Superman?

Top 5 most similar sentences in corpus:
Who would win Batman vs Batman? (Score: 0.8035)
Who is Batman's greatest foe? (Score: 0.7379)
Who would win in a fight between Iron Man and Batman? Why? (Score: 0.7228)
Superheroes: Who would win in a fight between Batman and the Flash? (Score: 0.7081)
Who would win in a fight between Captain America and Batman? (Score: 0.7061)
