<a href="https://colab.research.google.com/github/sanketrk/Clustering-of-Network-Complaint-Data/blob/master/2_Hands_on_Build_an_improved_search_engine_with_Transformers_and_Rerankers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Improved Question-Answering Search Engines with Transformers and Rerankers


![](https://i.imgur.com/7SXKckD.png)

Transfer Learning is the power of leveraging already trained models and tune \ adapt them to our own downstream tasks.

# QA Search Engine using Transformers

## Retrival and Re-ranking

In Semantic Search we have shown how to use SentenceTransformer to compute embeddings for queries, sentences, and paragraphs and how to use this for semantic search.

For complex search tasks, for example, for question answering retrieval, the search can significantly be improved by using Retrieve & Re-Rank.


## Retrieve & Re-Rank Pipeline

A pipeline for information retrieval / question answering retrieval that works well is the following. All components are provided and explained in this notebook:

![](https://i.imgur.com/yIXJRSo.png)


Given a search query, we first use a retrieval system that retrieves a large list of e.g. 100 possible hits which are potentially relevant for the query.
For the retrieval, we can use either lexical search, e.g. with ElasticSearch, or we can use dense retrieval with a bi-encoder. Simple Lexical searches can be based on TF-IDF, BM25 etc.


However, the retrieval system might retrieve documents that are not that relevant for the search query.
Hence, in a second stage, we use a re-ranker based on a cross-encoder that scores the relevancy of all candidates for the given search query.

The output will be a ranked list of hits we can present to the user.


## Retrieval: Bi-Encoder

For the retrieval of the candidate set, we can either use lexical search (e.g. ElasticSearch), or we can use a bi-encoder (semantic search) which is implemented in this repository.

Lexical search looks for literal matches of the query words in your document collection. It will not recognize synonyms, acronyms or spelling variations.

In contrast, semantic search (or dense retrieval) encodes the search query into vector space and retrieves the document embeddings that are close in vector space.

Bi-Encoders produce for a given sentence or document an embedding.


## Re-Ranker: Cross-Encoder

The retriever has to be efficient for large document collections with millions of entries. However, it might return irrelevant candidates.

A re-ranker based on a Cross-Encoder can substantially improve the final results for the user. The query and a possible document is passed simultaneously to transformer network, which then outputs a single score between 0 and 1 indicating how relevant the document is for the given query.

![](https://i.imgur.com/PFgkrcI.png)

The advantage of Cross-Encoders is the higher performance, as they perform attention across the query and the document.

Scoring thousands or millions of (query, document)-pairs would be rather slow. Hence, we use the retriever to create a set of e.g. 100 possible candidates which are then re-ranked by the Cross-Encoder.

First, you use an efficient Bi-Encoder to retrieve e.g. the top-100 most similar sentences for a query. Then, you use a Cross-Encoder to re-rank these 100 hits by computing the score for every (query, hit) combination.





## Retrieve & Re-Rank Search Engine over Simple Wikipedia

This examples demonstrates the Retrieve & Re-Rank Setup and allows to search over Simple Wikipedia.

You can input a query or a question. The script then uses semantic search to find relevant passages in Simple English Wikipedia

___[Created By: Dipanjan (DJ)](https://www.linkedin.com/in/dipanjans/)___

In [None]:
!nvidia-smi

### Install Dependencies

In [None]:
!pip install -U sentence-transformers

### Load Transformer Models, Wikipedia Data and Generate Embeddings

For semantic search, we use `SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')` and retrieve 32 potentially relevant passages that answer the input query.

Next, we use a more powerful CrossEncoder `(cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2'))` that scores the query and all retrieved passages for their relevancy. The cross-encoder further boost the performance.

MS MARCO is a large scale information retrieval corpus that was created based on real user search queries using Bing search engine.

The provided models can be used for semantic search, i.e., given keywords / a search phrase / a question, the model will find passages that are relevant for the search query.

## Load Wikipedia Dataset

In [None]:
import json
from sentence_transformers import SentenceTransformer, CrossEncoder, util
import gzip
import os
import torch


# As dataset, we use Simple English Wikipedia. Compared to the full English wikipedia, it has only
# about 170k articles. We split these articles into paragraphs and encode them with the bi-encoder

wikipedia_filepath = 'simplewiki-2020-11-01.jsonl.gz'

if not os.path.exists(wikipedia_filepath):
    util.http_get('http://sbert.net/datasets/simplewiki-2020-11-01.jsonl.gz', wikipedia_filepath)

passages = []
with gzip.open(wikipedia_filepath, 'rt', encoding='utf8') as fIn:
    for line in fIn:
        data = json.loads(line.strip())

        #Add all paragraphs
        #passages.extend(data['paragraphs'])

        #Only add the first paragraph
        passages.append(data['paragraphs'][0])

print("Passages:", len(passages))

## Subset Dataset

In [None]:
# We subset our data so we only use a subset of wikipedia to run things faster

passages = [passage for passage in passages for x in ['india', 'north pole', 'nlp',
                                                      'natural language processing', 'linguistics',
                                                      'machine learning', 'artificial intelligence',
                                                      'cheetah', 'animal', 'jaguar']
                                                    if x in passage.lower()]

## Look at sample documents

In [None]:
len(passages)

In [None]:
passages[0]

## Load Transformer Models

In [None]:
if not torch.cuda.is_available():
    print("Warning: No GPU found. Please add GPU to your notebook")


# We use the Bi-Encoder to encode all passages, so that we can use it with sematic search
bi_encoder = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
# The bi-encoder will retrieve 100 documents.
# We use a cross-encoder, to re-rank the results list to improve the quality
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

## Get Wikipedia Document Embeddings

In [None]:
# We encode all passages into our vector space. This takes about few seconds (depends on your GPU speed)
corpus_embeddings =

In [None]:
passages[0]

In [None]:
corpus_embeddings[0], corpus_embeddings[0].shape

## Try Search with a Sample Query

### New Query

In [None]:
query = "What is the capital of India?"
query

### Get Embedding for New Query

In [None]:
query_embedding =
query_embedding.shape

### Get Cosine Similarity Score of Document Emebddings compared to New Query Embedding

In [None]:
cos_scores =
cos_scores

### Get Most Similar Document ID

In [None]:
top_results =
idx =
idx

### Get Most Similar Document

## Alternate way of getting most similar document

In [None]:
hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=1)
hits[0]

In [None]:
hits[0][0]['corpus_id']

## Bi Encoder + ReRanker Cross Encoder Search

### Get top K Similar documents from Bi-encoder and format input data for Reranker Cross-encoder

In [None]:
# Get top 30 similar documents (hits) to the query
hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=30)
hits = hits[0]
# Format data for the reranker -> [query, similar_doc] for each of the top_k similar documents
reranker_inp = [[query, passages[hit['corpus_id']]] for hit in hits]
reranker_inp[:3] # look at the first 3 query inputs to the reranker cross encoder model

### Get Reranker score for every similar document

In [None]:
reranker_scores = cross_encoder.predict(reranker_inp)
reranker_scores[:3] # look at relevance scores from reranker cross encoder

### Add Reranker score back to the hits dictionary

In [None]:
hits[:3]

In [None]:
for id, hit in enumerate(hits):
    hit['reranker_score'] = reranker_scores[id]
hits[:3]

### Show the top similar document to query based on both models

In [None]:
print("Top Bi-Encoder Retrieval hit: ")
hit = sorted(hits, key=lambda x: x['score'], reverse=True)[0]
print(passages[hit['corpus_id']])

print("Top Reranker Retrieval hit: ")
hit = sorted(hits, key=lambda x: x['reranker_score'], reverse=True)[0]
print(passages[hit['corpus_id']])

## Create a function to return the top similar document based on any query

In [None]:
def search(query, top_k=30):
  # print the input question


  ##### Bi-Encoder: Sematic Search #####
  # Encode the query using the bi-encoder and find potentially relevant passages


  ##### Cross-Encoder: Re-Ranking #####
  # Now, score all retrieved passages with the reranker cross encoder


  # Store reranker cross encoder scores back into the hits variable


  # Output of top-1 hit from bi-encoder


  # Output of top-1 hit from re-ranker


## Try out the function

In [None]:
search(query = "What is the capital of India?")

In [None]:
search(query = "What is natural language processing?")

In [None]:
search(query = "What is language?")

In [None]:
search(query = "What is coldest place on earth?")

In [None]:
search(query = "What is the animal which can run very fast?")