<img src="https://miro.medium.com/max/3840/1*zIko_UJAnI5oOI7f9Dwacw.png">
<center><font color='brown' size=5>📖 Billion scale - similarity search</font></center>

# <font color='brown' size=4>Objective:</font> 
        
<p> Here, we explore on how semantic search is used for information retrieval to search across millions of records in less than a second. The same idea of indexing based retrieval can be extended to other structured and unstructured data like image and tabular data. Let's get started</p>

In [1]:
!pip install sentence_transformers
!pip install faiss-gpu

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sentence_transformers import SentenceTransformer, util, CrossEncoder, InputExample, losses, models, datasets
import torch
import os
import csv
import pickle
import time
import faiss
import glob
from pprint import pprint
from transformers import T5Tokenizer, T5ForConditionalGeneration
from tqdm import tqdm
from torch import nn
import random
import gc
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Table of Contents

- 1. Semantic search
   - 1.1 Types of semantic search
   - 1.2 Models available in sbert

- 2. Sbert overview
   - 2.1 Architecture
   - 2.2 Choosing the right model
      
- 3. FAISS
   - 3.1 Different indexes

- 4. Implementation
   - 4.1 Bi-encoder with FAISS search
   - 4.2 Bi-encoder + Cross encoder with FAISS search
   - 4.3 Finetuning Bi-encoder unsupervised way
       - 4.3.1 Bi-encoder + Cross encoder with FAISS search (finetuned)
   - 4.4 Allen-ai spector model
       
- 5. Acknowledgements

# <font color='brown' size=4>1. Semantic search</font>

<p> Semantic search is a data searching technique in a which a search query aims to not only find keywords, but to determine the intent and contextual meaning of the words a person is using for search.</p> 
    
<p>Semantic search seeks to improve search accuracy by understanding the content of the search query. In contrast to traditional search engines, that only finds documents based on lexical matches, semantic search can also find synonyms.</p>

<img src='https://www.ontotext.com/wp-content/uploads/2019/07/GraphDB-Semantic-Similarity-Plugin-for-Identifying-Related-Terms-Documents-1.png' width=1000>
<div align="center"><font size="3">Source: Google</font></div>

## <font color='brown' size=4>1.1 Types of semantic search</font>

**Types:**
* Symmetric search
* Asymmetric search

<p>A critical distinction between <b>symmetric vs. asymmetric semantic search:</b> is as follows</p>

<p>For symmetric semantic search your query and the entries in your corpus are of about the same length and have the same amount of content. An example would be searching for similar questions: Your query could for example be “How to learn Python online?” and you want to find an entry like “How to learn Python on the web?”. For symmetric tasks, you could potentially flip the query and the entries in your corpus.</p>

<p>For asymmetric semantic search, you usually have a short query (like a question or some keywords) and you want to find a longer paragraph answering the query. An example would be a query like “What is Python” and you wand to find the paragraph “Python is an interpreted, high-level and general-purpose programming language. Python’s design philosophy …”. For asymmetric tasks, flipping the query and the entries in your corpus usually does not make sense.</p>

<img src='https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/SemanticSearch.png' width=1000>
<div align="center"><font size="3">Source: Sbert</font></div>

## <font color='brown' size=4>1.2 Models available in sbert</font>

There are multiple models available in sbert which was finetuned on different datasets and benchmarked against for comparison.  Some of them are listed below 

<img src='https://miro.medium.com/max/700/1*-lVFYR44EnS1LpRtN6ZVvg.png' width=1000>
<div align="center"><font size="3">Source: Google</font></div>

# <font color='brown' size=4>2. Sbert overview</font>

<p>Sentence-BERT (SBERT), a modification of the BERT network uses siamese and triplet networks to derive semantically meaningful sentence embeddings
    
For sentence / text embeddings, we want to map a variable length input text to a fixed sized dense vector. The most basic network architecture we can use is the following:
</p>

<img src='https://www.sbert.net/_images/SBERT_Architecture.png' width=1000>
<div align="center"><font size="3">Source: Sbert</font></div>

<div class="alert simple-alert">
📌 <b>Note</b>: We feed the input sentence or text into a transformer network like BERT. BERT produces contextualized word embeddings for all input tokens in our text. As we want a fixed-sized output representation (vector u), we need a pooling layer. Different pooling options are available, the most basic one is mean-pooling: We simply average all contextualized word embeddings BERT is giving us. This gives us a fixed 768 dimensional output vector independet how long our input text was.
</div>

## <font color='brown' size=4>2.1 Architecture (STS benchmark)</font>

<p>The most simple way is to have sentence pairs annotated with a score indicating their similarity, e.g. on a scale 0 to 1. We can then train the network with a Siamese Network Architecture</p>

<img src='https://www.sbert.net/_images/SBERT_Siamese_Network1.png' width=1000>
<div align="center"><font size="3">Source: Sbert</font></div>

<div class="alert simple-alert">
📌 <b>Note</b>: For each sentence pair, we pass sentence A and sentence B through our network which yields the embeddings u und v. The similarity of these embeddings is computed using cosine similarity and the result is compared to the gold similarity score. This allows our network to be fine-tuned and to recognize the similarity of sentences.
</div>

## <font color='brown' size=4>2.2 Choosing the right model</font>

<p>It is critical that you choose the right model for your type of task. It is mostly distinguished by the type of data it has been trained on. Also models tuned for cosine-similarity will prefer the retrieval of short documents, while models tuned for dot-product will prefer the retrieval of longer documents. Depending on your task, the models of the one or the other type are preferable.</p>

Suitable models for <b>symmetric semantic search</b>:

* paraphrase-distilroberta-base-v1 / paraphrase-xlm-r-multilingual-v1

* quora-distilbert-base / quora-distilbert-multilingual

* distiluse-base-multilingual-cased-v2

Suitable models for <b>asymmetric semantic search</b>:

* msmarco-distilbert-base-v2

# <font color='brown' size=4>3. FAISS</font>

<p>Faiss is a C++ based library built by Facebook AI with a complete wrapper in python, to index vectorized data and to perform efficient searches on them. It solves limitations of traditional query search engines that are optimised for hash-based searches, and provides more scalable similarity search functions. Most importantly they have support for both <b>CPU and GPU version</b></p>

FAISS has a handful of features including:

* GPU and multithreaded support for index operations
* Dimensionality reduction: vectors with large dimensions can be reduced to smaller dimensions using PCA
* Quantisation: FAISS emphasises on product quantisation for compressing and storing vectors of large dimensions
* Batch processing i.e searching for multiple queries at a time

## <font color='brown' size=4>3.1 Different indexes</font>

Faiss offers different indexes based on the following factors
* search time
* search quality
* memory used per index vector
* training time
* need for external data for unsupervised training

Overall, our flow for semantic search will look like this:

<img src='https://miro.medium.com/max/4560/1*k7rgUFTqWdHyBY72PlCIow.png' width=1000>
<div align="center"><font size="3">Source: Medium</font></div>

# <font color='brown' size=4>4. Implementation</font>

We will use this [dataset](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge) for our exploration. This is a resource of over 500,000 scholarly articles, including over 200,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. This freely available dataset is provided to the global research community to apply recent advances in natural language processing and other AI techniques to generate new insights in support of the ongoing fight against this infectious disease.

In [1]:
embedder = SentenceTransformer('msmarco-distilbert-base-dot-prod-v3')

In [1]:
for ix,csv in enumerate(glob.glob('../input/cord-19-eda-parse-json-and-generate-clean-csv/*.csv')):
    if ix==0:
        data=pd.read_csv(csv)
    else:
        temp=pd.read_csv(csv)
        data=pd.concat([data,temp],axis=0).reset_index(drop=True)

In [1]:
data=data.dropna(subset=['title','abstract']).reset_index(drop=True)
data['abstract']=data['abstract'].apply(lambda x: x.replace('\n',' ')[9:].strip())
data['text']=data['text'].apply(lambda x: x.replace('\n',' ')[12:].strip())

In [1]:
data.head()

In [1]:
data.shape

<div class="alert simple-alert">
📌 <b>Note</b>: Here we use dot product pretrained models, since our data has features like title,abstract and text. we can use either keyword/title to get similar research papers as our document, which falls under short query-long document retrieval approach
</div>

# <font color='brown' size=4>4.1 Bi-encoder with FAISS search (inverted index search)</font>

Here, we use inverted index and the dataset is clustered into buckets and at search time, only a fraction of the buckets are visited (nprobe buckets). The clustering is performed on a representative sample of the dataset vectors, typically a sample of the dataset. We indicate the optimal size for this sample.

In [1]:
class faiss_search:
    
    def __init__(self,max_corpus_size = 100000,embedding_size= 768,top_k_hits= 5):
        
        self.embedding_cache_path = 'abstract-embeddings-{}-size-{}.pkl'.format('msmarco-distilbert-base-dot-prod-v3', max_corpus_size)

        #Defining our FAISS index
        #Number of clusters used for faiss. Select a value 4*sqrt(N) to 16*sqrt(N) - https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index
        n_clusters = round(np.sqrt(data.shape[0])*4)

        #We use Inner Product (dot-product) as Index. We will normalize our vectors to unit length, then is Inner Product equal to cosine similarity
        quantizer = faiss.IndexFlatIP(embedding_size)
        self.index = faiss.IndexIVFFlat(quantizer, embedding_size, n_clusters, faiss.METRIC_INNER_PRODUCT)

        #Number of clusters to explorer at search time. We will search for nearest neighbors in 3 clusters.
        self.index.nprobe = 3
        
    
    def embed_corpus(self,data,embedder):
    
        #Check if embedding cache path exists
        if not os.path.exists(self.embedding_cache_path):

            corpus_sentences = data['abstract'].values.tolist()
            print("Encode the corpus. This might take a while")
            corpus_embeddings = embedder.encode(corpus_sentences, show_progress_bar=True, convert_to_numpy=True)

            print("Store file on disc")
            with open(self.embedding_cache_path, "wb") as fOut:
                pickle.dump({'sentences': corpus_sentences, 'embeddings': corpus_embeddings}, fOut)
        else:
            print("Load pre-computed embeddings from disc")
            with open(self.embedding_cache_path, "rb") as fIn:
                cache_data = pickle.load(fIn)
                corpus_sentences = cache_data['sentences']
                corpus_embeddings = cache_data['embeddings']
                
        return corpus_sentences,corpus_embeddings
                
    def index_data(self,corpus_sentences,corpus_embeddings):
        ### Create the FAISS index
        print("Start creating FAISS index")
        # First, we need to normalize vectors to unit length
        corpus_embeddings = corpus_embeddings / np.linalg.norm(corpus_embeddings, axis=1)[:, None]

        # Then we train the index to find a suitable clustering
        self.index.train(corpus_embeddings)

        # Finally we add all embeddings to the index
        self.index.add(corpus_embeddings)

        print("Corpus loaded with {} sentences / embeddings".format(len(corpus_sentences)))


In [1]:
faiss_obj=faiss_search()
corpus_sentences,corpus_embeddings=faiss_obj.embed_corpus(data,embedder)
faiss_obj.index_data(corpus_sentences,corpus_embeddings)

The below cell compares the actual result which is computed using similarity index with faiss indexed results. Let's query for scientific datasets which are publicly available among this corpus

In [1]:
######### Search in the index ###########
top_k_hits=5
query='death rates of covid case'

start_time = time.time()
title_embedding = embedder.encode(query)

#FAISS works with inner product (dot product). When we normalize vectors to unit length, inner product is equal to cosine similarity
title_embedding = title_embedding / np.linalg.norm(title_embedding)
title_embedding = np.expand_dims(title_embedding, axis=0)

# Search in FAISS. It returns a matrix with distances and corpus ids.
distances, corpus_ids = faiss_obj.index.search(title_embedding, top_k_hits)

# We extract corpus ids and scores for the first query
hits = [{'corpus_id': id, 'score': score} for id, score in zip(corpus_ids[0], distances[0])]
hits = sorted(hits, key=lambda x: x['score'], reverse=True)
end_time = time.time()

print("Input title:", query)
print("\n")
print("Results (after {:.3f} seconds):".format(end_time-start_time))
print("\n")
for hit in hits[0:top_k_hits]:
    print("\t{:.3f}\t{}".format(hit['score'], corpus_sentences[hit['corpus_id']]))

# Approximate Nearest Neighbor (ANN) is not exact, it might miss entries with high cosine similarity
# Here, we compute the recall of ANN compared to the exact results
correct_hits = util.semantic_search(title_embedding, corpus_embeddings, top_k=top_k_hits)[0]
correct_hits_ids = set([hit['corpus_id'] for hit in correct_hits])

ann_corpus_ids = set([hit['corpus_id'] for hit in hits])
if len(ann_corpus_ids) != len(correct_hits_ids):
    print("Approximate Nearest Neighbor returned a different number of results than expected")

recall = len(ann_corpus_ids.intersection(correct_hits_ids)) / len(correct_hits_ids)
print("\nApproximate Nearest Neighbor Recall@{}: {:.2f}".format(top_k_hits, recall * 100))

if recall < 1:
    print("Missing results:")
    for hit in correct_hits[0:top_k_hits]:
        if hit['corpus_id'] not in ann_corpus_ids:
            print("\t{:.3f}\t{}".format(hit['score'], corpus_sentences[hit['corpus_id']]))
            
gc.collect()
del faiss_obj,corpus_sentences,corpus_embeddings

<div class="alert simple-alert">
📌 <b>Note</b>: We could see that our recall seems to be @60 for the given query, when we use normal cosine similarity based approach vs indexing. We missed out few important articles on death rates
</div>

<b>Bi-encoder with faiss search using flat indexing</b>

<p>The only index that can guarantee exact results is the IndexFlatL2 or IndexFlatIP. It provides the baseline for results for the other indexes. It does not compress the vectors, but does not add overhead on top of them. The flat index does not require training and does not have parameters.</p>

In [1]:
class faiss_index:
    def __init__(self,data,model):
        self.data=data
        self.model=model
    
    def index(self):
        encoded_data = self.model.encode(self.data['abstract'].values.tolist())
        encoded_data = np.asarray(encoded_data.astype('float32'))
        self.index = faiss.IndexIDMap(faiss.IndexFlatIP(768))
        self.index.add_with_ids(encoded_data, np.array(range(0, len(self.data))))
        faiss.write_index(self.index, 'test.index')
        
        
    def fetch(self,idx):
        info = self.data.iloc[idx]
        meta_dict = {}
        meta_dict['abstract'] = info['abstract']
        return meta_dict

    def search(self,query, top_k):
        t=time.time()
        query_vector = self.model.encode([query])
        top_k = self.index.search(query_vector, top_k)
        print('>>>> Results in Total Time: {}'.format(time.time()-t))
        top_k_ids = top_k[1].tolist()[0]
        top_k_ids = list(np.unique(top_k_ids))
        results =  [self.fetch(idx) for idx in top_k_ids]
        return results

In [1]:
faiss_ix=faiss_index(data,embedder)
faiss_ix.index()
results=faiss_ix.search(query, 5)

print("\n")
for result in results:
    print('\t',result)
    
gc.collect()
del faiss_ix

# <font color='brown' size=4>4.2 Bi-encoder + Cross encoder with FAISS search</font>

For complex search tasks, the search can significantly be improved by using <b>Retrieve & Re-Rank</b>.

<b>Retrieve & Re-Rank Pipeline:</b>

<img src='https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/InformationRetrieval.png' width=1000>
<div align="center"><font size="3">Source: Medium</font></div>

<p>Given a search query, we first use a retrieval system that retrieves a large list of e.g. 100 possible hits which are potentially relevant for the query. For the retrieval, we can use either lexical search, e.g. with ElasticSearch, or we can use dense retrieval with a bi-encoder.

However, the retrieval system might retrieve documents that are not that relevant for the search query. Hence, in a second stage, we use a re-ranker based on a cross-encoder that scores the relevancy of all candidates for the given search query.

The output will be a ranked list of hits we can present to the user.</p>

In [1]:
#We use the Bi-Encoder to encode all passages, so that we can use it with sematic search
bi_encoder = SentenceTransformer('msmarco-distilbert-base-dot-prod-v3')
top_k = 10     #Number of passages we want to retrieve with the bi-encoder

#The bi-encoder will retrieve k documents. We use a cross-encoder, to re-rank the results list to improve the quality
cross_encoder = CrossEncoder('cross-encoder/ms-marco-TinyBERT-L-2-v2')

passages=data['abstract'].values.tolist()

#If you like, you can also limit the number of passages you want to use
print("Passages:", len(passages))


def search(query,index):
    print("Input question:", query)

    ##### Sematic Search #####
    # Encode the query using the bi-encoder and find potentially relevant passages
    t=time.time()
    query_vector = embedder.encode([query])
    top_k = index.search(query_vector, 3)
    top_k_ids = top_k[1].tolist()[0]
    top_k_ids = list(np.unique(top_k_ids))
    print('>>>> Results in Total Time: {}'.format(time.time()-t))

    ##### Re-Ranking #####
    # Now, score all retrieved passages with the cross_encoder
    t=time.time()
    cross_inp = [[query, passages[hit]] for hit in top_k_ids]
    bienc_op=[passages[hit] for hit in top_k_ids]
    cross_scores = cross_encoder.predict(cross_inp)
    print('>>>> Results in Total Time: {}'.format(time.time()-t))

    # Output of top-5 hits from bi-encoder
    print("\n-------------------------\n")
    print("Top-3 Bi-Encoder Retrieval hits")
    results =  [passages[idx] for idx in top_k_ids]
    for result in results:
        print("\t{}".format(result.replace("\n", " ")))
        
    # Output of top-5 hits from re-ranker
    print("\n-------------------------\n")
    print("Top-3 Cross-Encoder Re-ranker hits")
    for hit in np.argsort(np.array(cross_scores))[::-1]:
        print("\t{}".format(bienc_op[hit].replace("\n", " ")))


In [1]:
query='what are the symptoms of influenza virus?'
index=faiss.read_index('test.index')
search(query,index)

<p>Voila, we seem to have a slight improvement when re-ranked</p>

In [1]:
gc.collect()
del index,cross_encoder,bi_encoder,passages

# <font color='brown' size=4>4.3 Finetuning Bi-encoder unsupervised way</font>

<p>We could have easily fine-tuned a sentence-transformer model on our dataset given if we had query & relevant passages information. But you would not have this data if building something from ground zero. (Here we are not dealing with pre-training approaches of transformer models. It is expensive and requires huge deal of data. Not to forget the domain here )</p>

<p>But, Can we device some unsupervised approach to fine-tune our model on our dataset.</p>

<b>Synthetic Query Generation:</b>

   <p>We use synthetic query generation to achieve our goal. We start with the passage from our document collection and create for these possible queries users might ask / might search for.</p>

<b>BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval</b> Models presented a method to learn (or adapt) model for asymmetric semantic search without requiring training data.


Let's finetune for few epochs on our dataset based on the above idea

The paragraphs input to this code are nothing but chunks of abstracts and each chunk will have maximum of 5 synthetically generated queries. If you think for a moment what are we trying to do here is have the possible information present in a paragraphs represented as questions and then use this knowledge tuple to fine-tune a s-bert model which will capture the semantic and syntactic information mapping between these tuples.

In [1]:
tokenizer = T5Tokenizer.from_pretrained('BeIR/query-gen-msmarco-t5-base-v1') #change to large and generate synthetic data fully
model = T5ForConditionalGeneration.from_pretrained('BeIR/query-gen-msmarco-t5-base-v1')
model.eval()

In [1]:
#Select the device
device = 'cuda'
model.to(device)

In [1]:
# Parameters for generation
batch_size = 16 #Batch size
num_queries = 5 #Number of queries to generate for every paragraph
max_length_paragraph = 512 #Max length for paragraph
max_length_query = 64   #Max length for output query

def _removeNonAscii(s): return "".join(i for i in s if ord(i) < 128)

paragraphs=data['abstract'].values.tolist()[:1000] ## we generate a sample synthetic data generation on 1000 abstracts

with open('generated_queries_all.tsv', 'w') as fOut:
    for start_idx in tqdm(range(0, len(paragraphs), batch_size)):
        sub_paragraphs = paragraphs[start_idx:start_idx+batch_size]
        inputs = tokenizer.prepare_seq2seq_batch(sub_paragraphs, max_length=max_length_paragraph, truncation=True, return_tensors='pt').to(device)
        outputs = model.generate(
            **inputs,
            max_length=max_length_query,
            do_sample=True,
            top_p=0.95,
            num_return_sequences=num_queries)

        for idx, out in enumerate(outputs):
            query = tokenizer.decode(out, skip_special_tokens=True)
            query = _removeNonAscii(query)
            para = sub_paragraphs[int(idx/num_queries)]
            para = _removeNonAscii(para)
            fOut.write("{}\t{}\n".format(query.replace("\t", " ").strip(), para.replace("\t", " ").strip()))

In [1]:
gc.collect()
del model,tokenizer,paragraphs

Let's see the sample query and passage which got generated above

In [1]:
query,para

In [1]:
train_examples = [] 
with open('generated_queries_all.tsv') as fIn:
    for line in fIn:
        try:
            query, paragraph = line.strip().split('\t', maxsplit=1)
            train_examples.append(InputExample(texts=[query, paragraph]))
        except:
            pass
        
random.shuffle(train_examples)

# For the MultipleNegativesRankingLoss, it is important
# that the batch does not contain duplicate entries, i.e.
# no two equal queries and no two equal paragraphs.
# To ensure this, we use a special data loader
train_dataloader = datasets.NoDuplicatesDataLoader(train_examples, batch_size=8)

# Now we create a SentenceTransformer model from scratch
word_emb = models.Transformer('sentence-transformers/msmarco-distilbert-base-dot-prod-v3')
pooling = models.Pooling(word_emb.get_word_embedding_dimension())
model = SentenceTransformer(modules=[word_emb, pooling])


# MultipleNegativesRankingLoss requires input pairs (query, relevant_passage)
# and trains the model so that is is suitable for semantic search
train_loss = losses.MultipleNegativesRankingLoss(model)


#Tune the model
num_epochs = 3
warmup_steps = int(len(train_dataloader) * num_epochs * 0.1)
model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=num_epochs, warmup_steps=warmup_steps, show_progress_bar=True)

os.makedirs('search', exist_ok=True)
model.save('search/search-model')


In [1]:
gc.collect()
del word_emb,pooling

<b>Re-indexing</b>

In [1]:
faiss_ix=faiss_index(data,model)
faiss_ix.index()

gc.collect()
del faiss_ix,model

# <font color='brown' size=4>4.3.1 Bi-encoder + Cross encoder with FAISS search (finetuned)</font>

In [1]:
#We use the Bi-Encoder to encode all passages, so that we can use it with sematic search
model_name = 'search/search-model'
bi_encoder = SentenceTransformer(model_name)
top_k = 100     #Number of passages we want to retrieve with the bi-encoder

#The bi-encoder will retrieve 100 documents. We use a cross-encoder, to re-rank the results list to improve the quality
cross_encoder = CrossEncoder('cross-encoder/ms-marco-TinyBERT-L-2-v2')

# As dataset, we use Simple English Wikipedia. Compared to the full English wikipedia, it has only

passages=data['abstract'].values.tolist()

#If you like, you can also limit the number of passages you want to use
print("Passages:", len(passages))


def search(query,index):
    print("Input question:", query)

    ##### Sematic Search #####
    # Encode the query using the bi-encoder and find potentially relevant passages
    t=time.time()
    query_vector = bi_encoder.encode([query])
    top_k = index.search(query_vector, 3)
    top_k_ids = top_k[1].tolist()[0]
    top_k_ids = list(np.unique(top_k_ids))
    print('>>>> Results in Total Time: {}'.format(time.time()-t))

    ##### Re-Ranking #####
    # Now, score all retrieved passages with the cross_encoder
    t=time.time()
    cross_inp = [[query, passages[hit]] for hit in top_k_ids]
    bienc_op=[passages[hit] for hit in top_k_ids]
    cross_scores = cross_encoder.predict(cross_inp)
    print('>>>> Results in Total Time: {}'.format(time.time()-t))

    # Output of top-5 hits from bi-encoder
    print("\n-------------------------\n")
    print("Top-3 Bi-Encoder Retrieval hits")
    for result in bienc_op:
        print("\t{}".format(result.replace("\n", " ")))
        
#     for idx in range(len(cross_scores)):
#         hits[idx]['cross-score'] = cross_scores[idx]
    
    # Output of top-5 hits from re-ranker
    print("\n-------------------------\n")
    print("Top-3 Cross-Encoder Re-ranker hits")
    for hit in np.argsort(np.array(cross_scores))[::-1]:
        print("\t{}".format(bienc_op[hit].replace("\n", " ")))


In [1]:
query='death rates of covid case'
index=faiss.read_index('test.index')
search(query,index)

# <font color='brown' size=4>4.4 Allen-ai spector model</font>

Spector encodes paper titles and abstracts into a vector space, then use util.semantic_search() to find the most similar papers. Same pipeline we followed above but without faiss indexing

In [1]:
print(len(passages), "papers loaded")

#We then load the allenai-specter model with SentenceTransformers
model = SentenceTransformer('allenai-specter')

#To encode the papers, we must combine the title and the abstracts to a single string
paper_texts = [paper[1]['title'] + '[SEP]' + paper[1]['abstract'] for paper in data.iterrows()]

#Compute embeddings for all papers
corpus_embeddings = model.encode(paper_texts, convert_to_tensor=True)


#We define a function, given title & abstract, searches our corpus for relevant (similar) papers
def search_papers(title, abstract):
  query_embedding = model.encode(title+'[SEP]'+abstract, convert_to_tensor=True)

  search_hits = util.semantic_search(query_embedding, corpus_embeddings)
  search_hits = search_hits[0]  #Get the hits for the first query

  print("\n\nPaper:", title)
  print("Most similar papers:")
  for hit in search_hits:
    related_paper = data.loc[hit['corpus_id']]
    print("{:.2f}\t{}\t{} {}".format(hit['score'], related_paper['title'], related_paper['authors'], related_paper['affiliations']))

In [1]:
search_papers(title='Specializing Word Embeddings (for Parsing) by Information Bottleneck',
              abstract='Pre-trained word embeddings like ELMo and BERT contain rich syntactic and semantic information, resulting in state-of-the-art performance on various tasks. We propose a very fast variational information bottleneck (VIB) method to nonlinearly compress these embeddings, keeping only the information that helps a discriminative parser. We compress each word embedding to either a discrete tag or a continuous vector. In the discrete version, our automatically compressed tags form an alternative tag set: we show experimentally that our tags capture most of the information in traditional POS tag annotations, but our tag sequences can be parsed more accurately at the same level of tag granularity. In the continuous version, we show experimentally that moderately compressing the word embeddings by our method yields a more accurate parser in 8 of 9 languages, unlike simple dimensionality reduction.')


#### <b>Cheers, happy kaggling!!!</b>

# <font color='brown' size=4>5. Acknowledgements</font>

* https://medium.com/mlearning-ai/semantic-search-with-s-bert-is-all-you-need-951bc710e160
* https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge
* https://www.sbert.net/examples/applications/semantic-search/README.html#background