<a href="https://www.kaggle.com/code/mahmoudhamza/nlp-semantic-search?scriptVersionId=106931277" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Aim 
>I am building knowledge to help me solve this [project](https://github.com/santarabantoosoo/Semantic_search_on_movies_scripts)

### What we need in simple terms? 

> search with a sentence or a long paragraph a large database of long documents and get the closest match 

How to do this? 

1- convert your query and your sentences into a vector. In this image you can visualize the vectors. 
[![L1OXVa.md.png](https://iili.io/L1OXVa.md.png)](https://freeimage.host/i/L1OXVa)  
You can then measure the similarity between these vectors. (for ex with cosine similarity

You can even ask a question and check among all documents where is the closest match to your question. I think this is different from question answering in that it has to rank documents. 

[![L1eMdl.md.png](https://iili.io/L1eMdl.md.png)](https://freeimage.host/i/L1eMdl)


## Motivation
> Why build a vector-based search engine?
Keyword-based search engines are easy to use and work well in most scenarios. You ask for machine learning papers and they return a bunch of results containing an exact match or a close variation of the query like machine-learning. Some of them might even return results containing synonyms of the query or words that appear in a similar context. Others, like Elasticsearch, do all of these things — and even more — in a rapid and scalable way. However, keyword-based search engines usually struggle with:

Complex queries or with words that have a dual meaning.
Long queries such as a paper abstract or a paragraph from a blog.
Users who are unfamiliar with a field’s jargon or users who would like to do exploratory search.

## Resources used 
[How to Build a Semantic Search Engine With Transformers and Faiss | by Kostas Stathoulopoulos | Towards Data Science](https://towardsdatascience.com/how-to-build-a-semantic-search-engine-with-transformers-and-faiss-dcbea307a0e8) with the corresponding [colab](https://colab.research.google.com/drive/11WBCrwNzbNWN7QbMEwzy-8MZROOVQFnZ?usp=sharing#scrollTo=AtSC6oDDjtMA)

[Semantic search with FAISS - Hugging Face Course](https://huggingface.co/course/chapter5/6?fw=tf)

## Resources to check

[Building a Playlist Generator with Sentence Transformers](https://huggingface.co/blog/playlist-generator)

[Deploy a machine learning application with Streamlit and Docker on AWS | by Kostas Stathoulopoulos | Towards Data Science](https://towardsdatascience.com/how-to-deploy-a-semantic-search-engine-with-streamlit-and-docker-on-aws-elastic-beanstalk-42ddce0422f3)

[NLP — Efficient Semantic Similarity Search with Faiss (Facebook AI Similarity Search) and GPUs | by Gabriel Tardochi Salles | Lett | Medium](https://medium.com/lett-digital/nlp-efficient-semantic-similarity-search-with-faiss-facebook-ai-similarity-search-and-gpus-274771d0709a)


[sentence-transformers (Sentence Transformers)](https://huggingface.co/sentence-transformers)

[Detecting explicit lyrics: a case study in Italian music | SpringerLink](https://link.springer.com/article/10.1007/s10579-022-09595-3)

[Home · facebookresearch/faiss Wiki](https://github.com/facebookresearch/faiss/wiki)

[youtube playlist - vector similarity search and FAISS course](https://www.youtube.com/watch?v=AY62z7HrghY&list=PLIUOU7oqGTLhlWpTz4NnuT3FekouIVlqc&index=2)

## Things to focus on 

> [variants of sentence transformers](https://docs.google.com/spreadsheets/d/14QplCdTCDwEmTqrn1LH4yrbKvdogK4oQvYO1K1aPR5M/edit#gid=0)

## Missing concepts  
- metrics 
- a more appealing use case (novelity of idea) 
- use a domain-specific transformer like SciBERT which has been trained on papers from the corpus of semanticscholar.org. 
- remove duplicates before returning the results 
- experiment with other indexes.

### Overview 
how keyword-based and vector-based search engines 
- index documents (ie store them in an easily retrievable form)
- vectorise text 
- measure how relevant a document is to a query.

[![L0bXHP.md.png](https://iili.io/L0bXHP.md.png)](https://freeimage.host/i/L0bXHP)

#### 1- Keyword-based search engines

- Reverse index as shown above 
- TF-IDF for vecotrizing documents and query

- Boolean Model (BM) with a Vector Space Model (VSM). BM marks which documents contain a user’s query and VSM scores how relevant they are. Cosine similarity of the VSM model determines the relevant documents.  

NOTE: This way of measuring similarity is very simplistic and not scalable. The workhorse behind Elasticsearch is Lucene which employs various tricks, from boosting fields to changing how vectors are normalised, to speed up the search and improve its quality.

#### 2-Vector-based search engines

To build our semantic search engine we will use Sentence Transformers that fine-tune BERT-based models to produce semantically meaningful embeddings of long-text sequences.

How to measure similarity between query and documents? 

1- cosine similarity between query and every document ==> veryh slow

2- Faiss, a library for efficient similarity search and clustering of dense vectors. Faiss offers a large collection of indexes and composite indexes. Moreover, given a GPU, Faiss scales up to billions of vectors!

## Tutorial: Building a vector-based search engine with Sentence Transformers and Faiss

dataset of 8,430 academic articles on misinformation, disinformation and fake news 

[![L15C3x.md.png](https://iili.io/L15C3x.md.png)](https://freeimage.host/i/L15C3x)

### Vectorising documents with Sentence Transformers

will use the distilbert-base-nli-stsb-mean-tokens model which performs great in Semantic Textual Similarity tasks and it’s quite faster than BERT as it is considerably smaller.

steps: 

- Instantiate the transformer
- Switch to a GPU if it is available.
- Use the `.encode()` method to vectorise all the paper abstracts.

### Indexing documents with Faiss

Faiss is built around the Index object which contains, and sometimes preprocesses, the searchable vectors. It handles collections of vectors of a fixed dimensionality d, typically a few 10s to 100s.

Faiss uses only 32-bit floating point matrices. This means we will have to change the data type of the input before building the index.

Different indices can be used for distance search. Here, we will use the IndexFlatL2 index that performs a brute-force L2 distance search. For faster search, u can CHECK THIS:  
[FAISS offers faster search](https://github.com/facebookresearch/faiss/wiki/Faster-search)

To create an index with the abstract vectors, we will:

- Change the data type of the abstract vectors to float32. - FAISS requires this 
- Build an index and pass it the dimension of the vectors it will operate on.
- Pass the index to IndexIDMap, an object that enables us to provide a custom list of IDs for the indexed vectors.
- Add the abstract vectors and their ID mapping to the index. In our case, we will map vectors to their paper IDs from Microsoft Academic Graph.


[![L1GccF.md.png](https://iili.io/L1GccF.md.png)](https://freeimage.host/i/L1GccF)

is there an easier way to use FAISS? 

check this from huggingface course chapter 5 

[![L1eD2R.md.png](https://iili.io/L1eD2R.md.png)](https://freeimage.host/i/L1eD2R)

However, the course is not using sentence transformers. It is using mean pooling which is way worse. I think we need to check HF book for sentence transformers implementation with FAISS 

### Searching with user queries

To retrieve academic articles for a new query, we would have to:

- Encode the query with the same sentence-DistilBERT model we used for the abstract vectors.
- Change its data type to float32.
- Search the index with the encoded query.