### Theory

**Semantic Search**

- Semantic search is an advanced form of search that can be used on large datasets as it goes beyond the traditional keyword-matching algorithms. It is based on the foundations of understanding the contextual meaning of the search query and returning the results from a corpus that are semantically relevant, even though the query and the search results might not contain the exact keywords.
- Semantic search uses concepts of machine learning and natural language processing to understand the meaning of the text, identify relationships in natural language, and extract the intent of the search query.
- Natural language processing is a branch of computer science that deals with the interaction between computers and the natural language of humans

**Benefits of Semantic Search**
- First, it can help to improve the accuracy of search results. By understanding the meaning of the search query, semantic search can return results that are more relevant to the user’s intent.
- Second, semantic search can help to improve the user experience. By providing more relevant results, semantic search can make it easier for users to find the information they are looking for.

**Cosine similarity**
- In this blog, we are going to use cosine similarity to perform the semantic search. Cosine similarity is a measure of similarity between two vectors. It is calculated by taking the dot product of the two vectors and dividing it by the product of their lengths.
- Cosine similarity is a good measure of similarity for text vectors because it is not affected by the order of the words in the text.

![image.png](attachment:image.png)

**Challenges of Cosine Similarity on Large Datasets:**
- Even though cosine similarity is one of the most ways to identify a similarity between two vectors but it can become computationally expensive to calculate, especially if the vectors are very large in size. For example, if u have two vectors A and B. The size of A is (100000, 768) and B is (100000,768) then their dot product A.B will have a size of (100000, 100000) which is a huge size. These matrixes can exceed the size of computer memory and hence can lead to memory errors.
- Sparse Matrices are highly prone to such issues as they are very large.

**Hence to summarize the challenges of performing cosine similarity on large datasets are:**
- Memory Issues: Calculating cosine similarity might result in huge matrices and thus require huge memory for calculation.
- Speed: Large dimension cosine similarity calculation leads to slow processing and thus isn’t reliable for fast processing use cases.
- Accuracy: The accuracy of the cosine similarity calculation can be affected by the number of dimensions in the vectors. If the vectors have a large number of dimensions, the accuracy of the calculation can be reduced.

**How to solve this problem?**
- There are several ways to overcome the challenges of calculating cosine similarity on large datasets. One way is to use a distributed computing framework, such as Hadoop or Spark. These frameworks can be used to distribute the calculation of the dot product and the product of the lengths of the vectors across multiple machines. This can significantly reduce the time it takes to calculate cosine similarity on large datasets.
- You can also use Services such as Elasticsearch and OpenSearch to speed up the semantic search.
- Another way to overcome the challenges of calculating cosine similarity on large datasets is to use a vector approximation technique. Vector approximation techniques can be used to reduce the size of the vectors without significantly affecting the accuracy of the cosine similarity calculation. This can make it possible to calculate cosine similarity on large datasets in a reasonable amount of time.

### Current Application

- Here we will use a combination of BM25 and Cosine Similarity to speed up the Semantic Search based on the following functional architecture.

![image.png](attachment:image.png)

**Understanding the above architecture:**
- A user sends a query to search from a corpus. BM25 will rank the corpus based on its score. Top 50 ranked data points will be then used by cosine similarity to match with the user query to get the reranking. Hence leveraging BM25 for speed and Cosine Similarity for accuracy.
- BM25 is a tfidf-based relevance algorithm that finds the relevant hits based on user queries from a corpus. The benefit of BM25 is that it is very fast but the downside is that it doesn’t understand the semantics of natural language. But when we combine BM25 with the Cosine Similarity algorithm we get the best of both worlds.

### Semantic Search Algorithm

#### Data Preprocessing