## How to Build a Semantic Search Engine With Transformers and Faiss

https://towardsdatascience.com/how-to-build-a-semantic-search-engine-with-transformers-and-faiss-dcbea307a0e8

![semanticsearh_bert_faiss.png](attachment:semanticsearh_bert_faiss.png)

In this tutorial, you will learn how to build a vector-based search engine with sentence transformers and Faiss. If you want to jump straight into the code, check out the GitHub repo and the Google Colab notebook.

In the second part of the tutorial, you will learn how to serve the search engine in a Streamlit application deploy it with Docker and AWS Elastic Beanstalk.

keyword-based search engines like ElasticSearch usually struggle with:
Complex queries or with words that have a dual meaning.
Long queries such as a paper abstract or a paragraph from a blog.
Users who are unfamiliar with a field’s jargon or users who would like to do exploratory search.

Vector-based (also called semantic) search engines tackle those pitfalls by finding a numerical representation of text queries using state-of-the-art language models, indexing them in a high-dimensional vector space 

### Elastic Search 
* Elasticsearch represents every indexed document with a high-dimensional, weighted vector, where each distinct index term is a dimension, and their value (or weight) is calculated with TF-IDF.
* To find relevant documents and rank them, Elasticsearch combines a Boolean Model (BM) with a Vector Space Model (VSM). BM marks which documents contain a user’s query and VSM scores how relevant they are. During search, the query is transformed to vector using the same TF-IDF pipeline and then the VSM score of document d for query q is the cosine similarity of the weighted query vectors V(q) and V(d).

This way of measuring similarity is very simplistic and not scalable. The workhorse behind Elasticsearch is Lucene which employs various tricks, from boosting fields to changing how vectors are normalised, to speed up the search and improve its quality.

### Tutorial: Building a vector-based search engine with Sentence Transformers and Faiss

In [2]:
## Importing libraries
import pandas as pd
import s3fs 
import torch
from sentence_transformers import SentenceTransformer

import faiss
import numpy as np
import pickle

In [3]:
df = pd.read_csv('s3://vector-search-blog/misinformation_papers.csv')

#### Vectorising documents with Sentence Transformers

In [4]:
# Instantiate the sentence-level DistilBERT
model = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')

HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=345.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=4014.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=555.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=122.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=229.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=265486777.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=53.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=112.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=466081.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=505.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=231508.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=190.0), HTML(value='')))




In [5]:
# Check if CUDA is available ans switch to GPU
if torch.cuda.is_available():
    model = model.to(torch.device("cuda"))
print(model.device)

cpu


#### Convert abstracts to vectors

In [7]:
embeddings = model.encode(df.abstract.to_list(), show_progress_bar=True)

HBox(children=(HTML(value='Batches'), FloatProgress(value=0.0, max=264.0), HTML(value='')))




#### Indexing documents with Faiss

Faiss uses only 32-bit floating point matrices. This means we will have to change the data type of the input before building the index.

Here, we will use the IndexFlatL2 index that performs a brute-force L2 distance search. It works well with our dataset, however, it can be very slow with a large dataset as it scales linearly with the number of indexed vectors. Faiss offers fast indexes too!

To create an index with the abstract vectors, we will:
* Change the data type of the abstract vectors to float32.
* Build an index and pass it the dimension of the vectors it will operate on.
* Pass the index to IndexIDMap, an object that enables us to provide a custom list of IDs for the indexed vectors.
* Add the abstract vectors and their ID mapping to the index. In our case, we will map vectors to their paper IDs from Microsoft Academic Graph.

In [8]:
# Step 1: Change data type
embeddings = np.array([embedding for embedding in embeddings]).astype("float32")

# Step 2: Instantiate the index
index = faiss.IndexFlatL2(embeddings.shape[1])

# Step 3: Pass the index to IndexIDMap
index = faiss.IndexIDMap(index)

# Step 4: Add vectors and their IDs
index.add_with_ids(embeddings, df.id.values)

# Retrieve the 10 nearest neighbours
D, I = index.search(np.array([embeddings[5415]]), k=10)

#### Searching with user queries

To retrieve academic articles for a new query, we would have to:
* Encode the query with the same sentence-DistilBERT model we used for the abstract vectors.
* Change its data type to float32.
* Search the index with the encoded query.

In [10]:
user_query = "WhatsApp was alleged to have been widely used to spread misinformation and propaganda during the 2018 elections in Brazil and the 2019 elections in India. Due to the private encrypted nature of the messages on WhatsApp, it is hard to track the dissemination of misinformation at scale. In this work, using public WhatsApp data from Brazil and India, we observe that misinformation has been largely shared on WhatsApp public groups even after they were already fact-checked by popular fact-checking agencies. This represents a significant portion of misinformation spread in both Brazil and India in the groups analyzed. We posit that such misinformation content could be prevented if WhatsApp had a means to flag already fact-checked content. To this end, we propose an architecture that could be implemented by WhatsApp to counter such misinformation. Our proposal respects the current end-to-end encryption architecture on WhatsApp, thus protecting users’ privacy while providing an approach to detect the misinformation that benefits from fact-checking efforts."

In [11]:
import numpy as np

def vector_search(query, model, index, num_results=10):
    """Tranforms query to vector using a pretrained, sentence-level
    DistilBERT model and finds similar vectors using FAISS.
    
    Args:
        query (str): User query that should be more than a sentence long.
        model (sentence_transformers.SentenceTransformer.SentenceTransformer)
        index (`numpy.ndarray`): FAISS index that needs to be deserialized.
        num_results (int): Number of results to return.
    
    Returns:
        D (:obj:`numpy.array` of `float`): Distance between results and query.
        I (:obj:`numpy.array` of `int`): Paper ID of the results.
    
    """
    vector = model.encode(list(query))
    D, I = index.search(np.array(vector).astype("float32"), k=num_results)
    return D, I


def id2details(df, I, column):
    """Returns the paper titles based on the paper index."""
    return [list(df[df.id == idx][column]) for idx in I[0]]

# Querying the index
D, I = vector_search([user_query], model, index, num_results=10)

In [6]:
print(len(df))

8430
