<a href="https://colab.research.google.com/github/sanketrk/Clustering-of-Network-Complaint-Data/blob/master/2_Hands_on_Build_an_improved_search_engine_with_Transformers_and_Rerankers_Solutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Improved Question-Answering Search Engines with Transformers and Rerankers


![](https://i.imgur.com/7SXKckD.png)

Transfer Learning is the power of leveraging already trained models and tune \ adapt them to our own downstream tasks.

# QA Search Engine using Transformers

## Retrival and Re-ranking

In Semantic Search we have shown how to use SentenceTransformer to compute embeddings for queries, sentences, and paragraphs and how to use this for semantic search.

For complex search tasks, for example, for question answering retrieval, the search can significantly be improved by using Retrieve & Re-Rank.


## Retrieve & Re-Rank Pipeline

A pipeline for information retrieval / question answering retrieval that works well is the following. All components are provided and explained in this notebook:

![](https://i.imgur.com/yIXJRSo.png)


Given a search query, we first use a retrieval system that retrieves a large list of e.g. 100 possible hits which are potentially relevant for the query.
For the retrieval, we can use either lexical search, e.g. with ElasticSearch, or we can use dense retrieval with a bi-encoder. Simple Lexical searches can be based on TF-IDF, BM25 etc.


However, the retrieval system might retrieve documents that are not that relevant for the search query.
Hence, in a second stage, we use a re-ranker based on a cross-encoder that scores the relevancy of all candidates for the given search query.

The output will be a ranked list of hits we can present to the user.


## Retrieval: Bi-Encoder

For the retrieval of the candidate set, we can either use lexical search (e.g. ElasticSearch), or we can use a bi-encoder (semantic search) which is implemented in this repository.

Lexical search looks for literal matches of the query words in your document collection. It will not recognize synonyms, acronyms or spelling variations.

In contrast, semantic search (or dense retrieval) encodes the search query into vector space and retrieves the document embeddings that are close in vector space.

Bi-Encoders produce for a given sentence or document an embedding.


## Re-Ranker: Cross-Encoder

The retriever has to be efficient for large document collections with millions of entries. However, it might return irrelevant candidates.

A re-ranker based on a Cross-Encoder can substantially improve the final results for the user. The query and a possible document is passed simultaneously to transformer network, which then outputs a single score between 0 and 1 indicating how relevant the document is for the given query.

![](https://i.imgur.com/PFgkrcI.png)

The advantage of Cross-Encoders is the higher performance, as they perform attention across the query and the document.

Scoring thousands or millions of (query, document)-pairs would be rather slow. Hence, we use the retriever to create a set of e.g. 100 possible candidates which are then re-ranked by the Cross-Encoder.

First, you use an efficient Bi-Encoder to retrieve e.g. the top-100 most similar sentences for a query. Then, you use a Cross-Encoder to re-rank these 100 hits by computing the score for every (query, hit) combination.





## Retrieve & Re-Rank Search Engine over Simple Wikipedia

This examples demonstrates the Retrieve & Re-Rank Setup and allows to search over Simple Wikipedia.

You can input a query or a question. The script then uses semantic search to find relevant passages in Simple English Wikipedia

___[Created By: Dipanjan (DJ)](https://www.linkedin.com/in/dipanjans/)___

In [None]:
!nvidia-smi

Wed Mar 13 21:33:30 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   65C    P8              12W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

### Install Dependencies

In [None]:
!pip install -U sentence-transformers

Collecting sentence-transformers
  Downloading sentence_transformers-2.5.1-py3-none-any.whl (156 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/156.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━[0m [32m122.9/156.5 kB[0m [31m3.4 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m156.5/156.5 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m39.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━

### Load Transformer Models, Wikipedia Data and Generate Embeddings

For semantic search, we use `SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')` and retrieve 32 potentially relevant passages that answer the input query.

Next, we use a more powerful CrossEncoder `(cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2'))` that scores the query and all retrieved passages for their relevancy. The cross-encoder further boost the performance.

MS MARCO is a large scale information retrieval corpus that was created based on real user search queries using Bing search engine.

The provided models can be used for semantic search, i.e., given keywords / a search phrase / a question, the model will find passages that are relevant for the search query.

## Load Wikipedia Dataset

In [None]:
import json
from sentence_transformers import SentenceTransformer, CrossEncoder, util
import gzip
import os
import torch


# As dataset, we use Simple English Wikipedia. Compared to the full English wikipedia, it has only
# about 170k articles. We split these articles into paragraphs and encode them with the bi-encoder

wikipedia_filepath = 'simplewiki-2020-11-01.jsonl.gz'

if not os.path.exists(wikipedia_filepath):
    util.http_get('http://sbert.net/datasets/simplewiki-2020-11-01.jsonl.gz', wikipedia_filepath)

passages = []
with gzip.open(wikipedia_filepath, 'rt', encoding='utf8') as fIn:
    for line in fIn:
        data = json.loads(line.strip())

        #Add all paragraphs
        #passages.extend(data['paragraphs'])

        #Only add the first paragraph
        passages.append(data['paragraphs'][0])

print("Passages:", len(passages))

Passages: 169597


## Subset Dataset

In [None]:
# We subset our data so we only use a subset of wikipedia to run things faster

passages = [passage for passage in passages for x in ['india', 'north pole', 'nlp',
                                                      'natural language processing', 'linguistics',
                                                      'machine learning', 'artificial intelligence',
                                                      'cheetah', 'animal', 'jaguar']
                                                    if x in passage.lower()]

## Look at sample documents

In [None]:
len(passages)

5917

In [None]:
passages[0]

"The integumentary system is everything covering the outside of an animal's body. This account is written mostly with people in mind, but it applies more widely."

## Load Transformer Models

In [None]:
if not torch.cuda.is_available():
    print("Warning: No GPU found. Please add GPU to your notebook")


# We use the Bi-Encoder to encode all passages, so that we can use it with sematic search
bi_encoder = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
# The bi-encoder will retrieve 100 documents.
# We use a cross-encoder, to re-rank the results list to improve the quality
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

## Get Wikipedia Document Embeddings

In [None]:
# We encode all passages into our vector space. This takes about few seconds (depends on your GPU speed)
corpus_embeddings = bi_encoder.encode(passages, show_progress_bar=True)

Batches:   0%|          | 0/185 [00:00<?, ?it/s]

In [None]:
passages[0]

"The integumentary system is everything covering the outside of an animal's body. This account is written mostly with people in mind, but it applies more widely."

In [None]:
corpus_embeddings[0], corpus_embeddings[0].shape

(array([ 2.43610446e-03,  3.29437926e-02,  4.70606014e-02, -7.68167479e-03,
         1.48560539e-01, -7.82975107e-02,  2.99415477e-02,  1.36641292e-02,
         1.53808743e-02,  1.40475079e-01,  2.35609505e-02, -8.67463648e-02,
         2.46762913e-02,  8.08474980e-03, -1.67811643e-02, -6.50802329e-02,
        -3.48654017e-02, -7.70691840e-04, -5.02524003e-02,  7.39627331e-03,
        -2.48694886e-02,  6.15889803e-02, -2.88300738e-02,  8.74973182e-03,
        -1.11718066e-01, -3.45828454e-03, -4.96662334e-02, -5.51604889e-02,
        -3.83206941e-02, -6.52881041e-02, -4.57847957e-03, -4.55709547e-02,
         5.72044179e-02,  3.42551805e-02, -4.65118140e-03, -2.70470697e-02,
         3.08932122e-02, -2.60512847e-02, -5.11268862e-02,  3.61177027e-02,
         7.31430808e-03,  2.07178947e-02,  2.00569984e-02,  2.59086005e-02,
         8.13343152e-02,  4.40371670e-02, -1.29732475e-01, -6.40622452e-02,
         8.23869649e-03, -3.58666468e-04,  7.39453956e-02, -4.48799320e-02,
        -4.2

## Try Search with a Sample Query

### New Query

In [None]:
query = "What is the capital of India?"
query

'What is the capital of India?'

### Get Embedding for New Query

In [None]:
query_embedding = bi_encoder.encode(query)
query_embedding.shape

(384,)

### Get Cosine Similarity Score of Document Emebddings compared to New Query Embedding

In [None]:
cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
cos_scores

tensor([ 0.0209, -0.0524,  0.2248,  ...,  0.1950, -0.1026,  0.3683])

### Get Most Similar Document ID

In [None]:
top_results = torch.topk(cos_scores, k=1)
idx = top_results.indices.item()
idx

94

### Get Most Similar Document

In [None]:
passages[idx]

"Mumbai (previously known as Bombay until 1996) is a natural harbor on the west coast of India, and is the capital city of Maharashtra state. It is India's largest city, and one of the world's most populous cities. It is the financial capital of India. The city is the second most-populous in the world. It has approximately 13 million people. Along with the neighboring cities of Navi Mumbai and Thane, it forms the world's 4th largest urban agglomeration. They have around 19.1 million people."

## Alternate way of getting most similar document

In [None]:
hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=1)
hits[0]

[{'corpus_id': 94, 'score': 0.5979240536689758}]

In [None]:
hits[0][0]['corpus_id']

94

## Bi Encoder + ReRanker Cross Encoder Search

### Get top K Similar documents from Bi-encoder and format input data for Reranker Cross-encoder

In [None]:
# Get top 30 similar documents (hits) to the query
hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=30)
hits = hits[0]
# Format data for the reranker -> [query, similar_doc] for each of the top_k similar documents
reranker_inp = [[query, passages[hit['corpus_id']]] for hit in hits]
reranker_inp[:3] # look at the first 3 query inputs to the reranker cross encoder model

[['What is the capital of India?',
  "Mumbai (previously known as Bombay until 1996) is a natural harbor on the west coast of India, and is the capital city of Maharashtra state. It is India's largest city, and one of the world's most populous cities. It is the financial capital of India. The city is the second most-populous in the world. It has approximately 13 million people. Along with the neighboring cities of Navi Mumbai and Thane, it forms the world's 4th largest urban agglomeration. They have around 19.1 million people."],
 ['What is the capital of India?',
  "Kolkata (spelled Calcutta before 1 January 2001) is the capital city of the Indian state of West Bengal. It is the second largest city in India after Mumbai. It is on the east bank of the River Hooghly. When it is called Calcutta, it includes the suburbs. This makes it the third largest city of India. This also makes it the world's 8th largest metropolitan area as defined by the United Nations. Kolkata served as the capita

### Get Reranker score for every similar document

In [None]:
reranker_scores = cross_encoder.predict(reranker_inp)
reranker_scores[:3] # look at relevance scores from reranker cross encoder

array([3.8610315, 3.4595225, 2.7084029], dtype=float32)

### Add Reranker score back to the hits dictionary

In [None]:
hits[:3]

[{'corpus_id': 94, 'score': 0.5979240536689758},
 {'corpus_id': 789, 'score': 0.5937108397483826},
 {'corpus_id': 4586, 'score': 0.5878058671951294}]

In [None]:
for id, hit in enumerate(hits):
    hit['reranker_score'] = reranker_scores[id]
hits[:3]

[{'corpus_id': 94, 'score': 0.5979240536689758, 'reranker_score': 3.8610315},
 {'corpus_id': 789, 'score': 0.5937108397483826, 'reranker_score': 3.4595225},
 {'corpus_id': 4586, 'score': 0.5878058671951294, 'reranker_score': 2.7084029}]

### Show the top similar document to query based on both models

In [None]:
print("Top Bi-Encoder Retrieval hit: ")
hit = sorted(hits, key=lambda x: x['score'], reverse=True)[0]
print(passages[hit['corpus_id']])

print("Top Reranker Retrieval hit: ")
hit = sorted(hits, key=lambda x: x['reranker_score'], reverse=True)[0]
print(passages[hit['corpus_id']])

Top Bi-Encoder Retrieval hit: 
Mumbai (previously known as Bombay until 1996) is a natural harbor on the west coast of India, and is the capital city of Maharashtra state. It is India's largest city, and one of the world's most populous cities. It is the financial capital of India. The city is the second most-populous in the world. It has approximately 13 million people. Along with the neighboring cities of Navi Mumbai and Thane, it forms the world's 4th largest urban agglomeration. They have around 19.1 million people.
Top Reranker Retrieval hit: 
New Delhi () is the capital of India and a union territory of the megacity of Delhi. It has a very old history and is home to several monuments where the city is expensive to live in. In traditional Indian geography it falls under the North Indian zone. The city has an area of about 42.7 km. New Delhi has a population of about 9.4 Million people.


## Create a function to return the top similar document based on any query

In [None]:
def search(query, top_k=30):
  # print the input question
  print("Input question:", query)

  ##### Bi-Encoder: Sematic Search #####
  # Encode the query using the bi-encoder and find potentially relevant passages
  question_embedding = bi_encoder.encode(query)
  hits = util.semantic_search(question_embedding, corpus_embeddings, top_k=top_k)
  hits = hits[0]

  ##### Cross-Encoder: Re-Ranking #####
  # Now, score all retrieved passages with the reranker cross encoder
  reranker_inp = [[query, passages[hit['corpus_id']]] for hit in hits]
  reranker_scores = cross_encoder.predict(reranker_inp)

  # Store reranker cross encoder scores back into the hits variable
  for id, hit in enumerate(hits):
    hit['reranker_score'] = reranker_scores[id]

  # Output of top-1 hit from bi-encoder
  print("\n-------------------------\n")
  print("Top Bi-Encoder Retrieval hit")
  hit = sorted(hits, key=lambda x: x['score'], reverse=True)[0]
  print(passages[hit['corpus_id']])

  # Output of top-1 hit from re-ranker
  print("\n-------------------------\n")
  print("Top Cross-Encoder Re-ranker hit")
  hit = sorted(hits, key=lambda x: x['reranker_score'], reverse=True)[0]
  print(passages[hit['corpus_id']])

## Try out the function

In [None]:
search(query = "What is the capital of India?")

Input question: What is the capital of India?

-------------------------

Top Bi-Encoder Retrieval hit
Mumbai (previously known as Bombay until 1996) is a natural harbor on the west coast of India, and is the capital city of Maharashtra state. It is India's largest city, and one of the world's most populous cities. It is the financial capital of India. The city is the second most-populous in the world. It has approximately 13 million people. Along with the neighboring cities of Navi Mumbai and Thane, it forms the world's 4th largest urban agglomeration. They have around 19.1 million people.

-------------------------

Top Cross-Encoder Re-ranker hit
New Delhi () is the capital of India and a union territory of the megacity of Delhi. It has a very old history and is home to several monuments where the city is expensive to live in. In traditional Indian geography it falls under the North Indian zone. The city has an area of about 42.7 km. New Delhi has a population of about 9.4 Million p

In [None]:
search(query = "What is natural language processing?")

Input question: What is natural language processing?

-------------------------

Top Bi-Encoder Retrieval hit
Natural Language Processing (NLP) is a field in Artificial Intelligence, and is also related to linguistics. On a high level, the goal of NLP is to program computers to automatically understand human languages, and also to automatically write/speak in human languages. We say "Natural Language" to mean human language, and to indicate that we are not talking about computer (programming) languages.

-------------------------

Top Cross-Encoder Re-ranker hit
Natural Language Processing (NLP) is a field in Artificial Intelligence, and is also related to linguistics. On a high level, the goal of NLP is to program computers to automatically understand human languages, and also to automatically write/speak in human languages. We say "Natural Language" to mean human language, and to indicate that we are not talking about computer (programming) languages.


In [None]:
search(query = "What is language?")

Input question: What is language?

-------------------------

Top Bi-Encoder Retrieval hit
Philosophy of language is the study of how languages were created and are used. It is part of Linguistics. In continental philosophy, it is not treated as a subject by itself, but Ludwig Wittgenstein and other analytic philosophers placed particular stress on it.

-------------------------

Top Cross-Encoder Re-ranker hit
Language is the normal way humans communicate. Only humans use language, though other animals communicate through other means.


In [None]:
search(query = "What is coldest place on earth?")

Input question: What is coldest place on earth?

-------------------------

Top Bi-Encoder Retrieval hit
East Antarctica, also called Greater Antarctica, is the largest part (two-thirds) of the Antarctic continent. It is on the Indian Ocean side of the Transantarctic Mountains. It is the coldest, windiest, and driest part of Earth. East Antarctica holds the record as the coldest place on earth.

-------------------------

Top Cross-Encoder Re-ranker hit
East Antarctica, also called Greater Antarctica, is the largest part (two-thirds) of the Antarctic continent. It is on the Indian Ocean side of the Transantarctic Mountains. It is the coldest, windiest, and driest part of Earth. East Antarctica holds the record as the coldest place on earth.


In [None]:
search(query = "What is the animal which can run very fast?")

Input question: What is the animal which can run very fast?

-------------------------

Top Bi-Encoder Retrieval hit
Running is the way in which people or animals travel quickly on their feet. It is a method of travelling on land. It is different to walking in that both feet are regularly off the ground at the same time. Different terms are used to refer to running according to the speed: jogging is slow, and sprinting is running fast.

-------------------------

Top Cross-Encoder Re-ranker hit
A cheetah ("Acinonyx jubatus") is a medium large cat which lives in Africa. It is the fastest land animal and can run up to 112 kilometers per hour for a short time. Most cheetahs live in the savannas of Africa. There are a few in Asia. Cheetahs are active during the day, and hunt in the early morning or late evening.
