<a href="https://colab.research.google.com/github/rahiakela/transformers-research-and-practice/blob/main/sentence-transformer-works/04_quora_semantic_search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Semantic Search using SBERT on Quora Questions dataset

This script contains an example how to perform semantic search with PyTorch. It performs exact nearest neighborh search.

As dataset, we use the Quora Duplicate Questions dataset, which contains about 500k questions (we only use about 100k):
https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs


As embeddings model, we use the SBERT model 'quora-distilbert-multilingual',
that it aligned for 100 languages. I.e., you can type in a question in various languages and it will
return the closest questions in the corpus (questions in the corpus are mainly in English).


##Setup

In [None]:
!pip install sentence-transformers

In [2]:
from sentence_transformers import SentenceTransformer, util
import os
import csv
import time
import torch

if not torch.cuda.is_available():
  print("Warning: No GPU detected. Processing will be slow. Please add a GPU to this notebook")

In [None]:
model_name = 'quora-distilbert-multilingual'
model = SentenceTransformer(model_name)

##Build embeddings

In [None]:
url = "http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv"
dataset_path = "quora_duplicate_questions.tsv"
max_corpus_size = 100000


# Check if the dataset exists. If not, download and extract
# Download dataset if needed
if not os.path.exists(dataset_path):
    print("Download dataset")
    util.http_get(url, dataset_path)

In [5]:
# Get all unique sentences from the file
corpus_sentences = set()
with open(dataset_path, encoding='utf8') as fIn:
    reader = csv.DictReader(fIn, delimiter='\t', quoting=csv.QUOTE_MINIMAL)
    for row in reader:
        corpus_sentences.add(row['question1'])
        if len(corpus_sentences) >= max_corpus_size:
            break

        corpus_sentences.add(row['question2'])
        if len(corpus_sentences) >= max_corpus_size:
            break

corpus_sentences = list(corpus_sentences)
print("Encoding the corpus. This might take a while")
corpus_embeddings = model.encode(corpus_sentences, show_progress_bar=True, convert_to_tensor=True)

###############################
print("Corpus loaded with {} sentences / embeddings".format(len(corpus_sentences)))

Encode the corpus. This might take a while


Batches:   0%|          | 0/3125 [00:00<?, ?it/s]

Corpus loaded with 100000 sentences / embeddings


##Similarity Search

In [6]:
# Function that searches the corpus and prints the results
def search(inp_question):
    start_time = time.time()
    question_embedding = model.encode(inp_question, convert_to_tensor=True)
    hits = util.semantic_search(question_embedding, corpus_embeddings)
    end_time = time.time()
    hits = hits[0]  #Get the hits for the first query

    print("Input question:", inp_question)
    print("Results (after {:.3f} seconds):".format(end_time-start_time))
    for hit in hits[0:5]:
      print("\t{:.3f}\t{}".format(hit['score'], corpus_sentences[hit['corpus_id']]))

In [7]:
search("How can I learn Python online?")

Input question: How can I learn Python online?
Results (after 0.030 seconds):
	0.987	What is the best online resources to learn Python?
	0.980	Where I should learn Python?
	0.980	What's the best way to learn Python?
	0.980	How do I learn Python in easy way?
	0.980	What can I do if I learn Python?


In [8]:
#German: How can I learn Python online?
search("Wie kann ich Python online lernen?")

Input question: Wie kann ich Python online lernen?
Results (after 0.019 seconds):
	0.989	What is the best online resources to learn Python?
	0.980	Where I should learn Python?
	0.980	Where should I start at to learn about how to do Python?
	0.980	Which are the best resources to learn Python?
	0.979	What's the best way to learn Python?


In [9]:
#Chinese: How can I learn Python online?
search("如何在线学习Python")

Input question: 如何在线学习Python
Results (after 0.021 seconds):
	0.983	What is the best online resources to learn Python?
	0.982	Where can I learn web scraping using Python?
	0.982	Where should I start at to learn about how to do Python?
	0.981	What are some good resources to learn web scraping with Python?
	0.980	How do I learn Python systematically?
