### Theory

**Semantic Search**

- Semantic search is an advanced form of search that can be used on large datasets as it goes beyond the traditional keyword-matching algorithms. It is based on the foundations of understanding the contextual meaning of the search query and returning the results from a corpus that are semantically relevant, even though the query and the search results might not contain the exact keywords.
- Semantic search uses concepts of machine learning and natural language processing to understand the meaning of the text, identify relationships in natural language, and extract the intent of the search query.
- Natural language processing is a branch of computer science that deals with the interaction between computers and the natural language of humans

**Benefits of Semantic Search**
- First, it can help to improve the accuracy of search results. By understanding the meaning of the search query, semantic search can return results that are more relevant to the user’s intent.
- Second, semantic search can help to improve the user experience. By providing more relevant results, semantic search can make it easier for users to find the information they are looking for.

**Cosine similarity**
- In this blog, we are going to use cosine similarity to perform the semantic search. Cosine similarity is a measure of similarity between two vectors. It is calculated by taking the dot product of the two vectors and dividing it by the product of their lengths.
- Cosine similarity is a good measure of similarity for text vectors because it is not affected by the order of the words in the text.

![image.png](attachment:image.png)

**Challenges of Cosine Similarity on Large Datasets:**
- Even though cosine similarity is one of the most ways to identify a similarity between two vectors but it can become computationally expensive to calculate, especially if the vectors are very large in size. For example, if u have two vectors A and B. The size of A is (100000, 768) and B is (100000,768) then their dot product A.B will have a size of (100000, 100000) which is a huge size. These matrixes can exceed the size of computer memory and hence can lead to memory errors.
- Sparse Matrices are highly prone to such issues as they are very large.

**Hence to summarize the challenges of performing cosine similarity on large datasets are:**
- Memory Issues: Calculating cosine similarity might result in huge matrices and thus require huge memory for calculation.
- Speed: Large dimension cosine similarity calculation leads to slow processing and thus isn’t reliable for fast processing use cases.
- Accuracy: The accuracy of the cosine similarity calculation can be affected by the number of dimensions in the vectors. If the vectors have a large number of dimensions, the accuracy of the calculation can be reduced.

**How to solve this problem?**
- There are several ways to overcome the challenges of calculating cosine similarity on large datasets. One way is to use a distributed computing framework, such as Hadoop or Spark. These frameworks can be used to distribute the calculation of the dot product and the product of the lengths of the vectors across multiple machines. This can significantly reduce the time it takes to calculate cosine similarity on large datasets.
- You can also use Services such as Elasticsearch and OpenSearch to speed up the semantic search.
- Another way to overcome the challenges of calculating cosine similarity on large datasets is to use a vector approximation technique. Vector approximation techniques can be used to reduce the size of the vectors without significantly affecting the accuracy of the cosine similarity calculation. This can make it possible to calculate cosine similarity on large datasets in a reasonable amount of time.

### Current Application

- Here we will use a combination of BM25 and Cosine Similarity to speed up the Semantic Search based on the following functional architecture.

![image.png](attachment:image.png)

**Understanding the above architecture:**
- A user sends a query to search from a corpus. BM25 will rank the corpus based on its score. Top 50 ranked data points will be then used by cosine similarity to match with the user query to get the reranking. Hence leveraging BM25 for speed and Cosine Similarity for accuracy.
- BM25 is a tfidf-based relevance algorithm that finds the relevant hits based on user queries from a corpus. The benefit of BM25 is that it is very fast but the downside is that it doesn’t understand the semantics of natural language. But when we combine BM25 with the Cosine Similarity algorithm we get the best of both worlds.

In [4]:
!pip install sentence-transformers
!pip install datasets
!pip install rank-bm25



### Semantic Search Algorithm

In [48]:
import numpy as np

# to import data
import datasets

# data preprocessing
import nltk
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

# generate embeddings
from sentence_transformers import SentenceTransformer
from sklearn.preprocessing import normalize
from tqdm import tqdm

# BM25
from rank_bm25 import BM25Okapi

# exporting trained models
import joblib
import h5py
import json

In [7]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

#### Loading Datasets

In [8]:
quora = datasets.load_dataset('quora', split = 'train[:50000]')

Downloading builder script:   0%|          | 0.00/2.38k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.13k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/5.69k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/58.2M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/404290 [00:00<?, ? examples/s]

In [9]:
type(quora)

datasets.arrow_dataset.Dataset

In [10]:
len(quora)

50000

In [43]:
quora[0]

{'questions': {'id': [1, 2],
  'text': ['What is the step by step guide to invest in share market in india?',
   'What is the step by step guide to invest in share market?']},
 'is_duplicate': False}

In [44]:
quora[1]

{'questions': {'id': [3, 4],
  'text': ['What is the story of Kohinoor (Koh-i-Noor) Diamond?',
   'What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back?']},
 'is_duplicate': False}

In [12]:
for i in range(10):
    print(quora[i])

{'questions': {'id': [1, 2], 'text': ['What is the step by step guide to invest in share market in india?', 'What is the step by step guide to invest in share market?']}, 'is_duplicate': False}
{'questions': {'id': [3, 4], 'text': ['What is the story of Kohinoor (Koh-i-Noor) Diamond?', 'What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back?']}, 'is_duplicate': False}
{'questions': {'id': [5, 6], 'text': ['How can I increase the speed of my internet connection while using a VPN?', 'How can Internet speed be increased by hacking through DNS?']}, 'is_duplicate': False}
{'questions': {'id': [7, 8], 'text': ['Why am I mentally very lonely? How can I solve it?', 'Find the remainder when [math]23^{24}[/math] is divided by 24,23?']}, 'is_duplicate': False}
{'questions': {'id': [9, 10], 'text': ['Which one dissolve in water quikly sugar, salt, methane and carbon di oxide?', 'Which fish would survive in salt water?']}, 'is_duplicate': False}
{'questions': {'id':

In [13]:
corpus = []

for dataset in quora:
    corpus.append(dataset['questions']['text'][0])
    corpus.append(dataset['questions']['text'][1])

print(len(corpus))

corpus[:10]

100000


['What is the step by step guide to invest in share market in india?',
 'What is the step by step guide to invest in share market?',
 'What is the story of Kohinoor (Koh-i-Noor) Diamond?',
 'What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back?',
 'How can I increase the speed of my internet connection while using a VPN?',
 'How can Internet speed be increased by hacking through DNS?',
 'Why am I mentally very lonely? How can I solve it?',
 'Find the remainder when [math]23^{24}[/math] is divided by 24,23?',
 'Which one dissolve in water quikly sugar, salt, methane and carbon di oxide?',
 'Which fish would survive in salt water?']

- Each dictionary object has two questions
- So, from 50000 dictionary we have got 100000 questions
- We have a corpus of size 100000

#### Data Preprocessing

**We will use custom character filtering function**
- The custom regex can filter the string (remove and replace unnecessary characters) and the function will return the cleaned version

In [14]:
def char_filter(string, reg="[a-zA-Z'-]+|[0-9]{1,}%|[0-9]{1,}\.[0-9]{1,}%|\d+\.\d+%}"):
    regex=reg
    string=string.replace("-"," ")
    return " ".join(re.findall(regex, string))

In [15]:
print(char_filter("How are you doing?"))
print(char_filter("How-are-you-doing?"))
print(char_filter("How-are        you-doing?"))
print(char_filter("How#1 are@2 you! doing? #2"))
print(char_filter("How~ are123 you#123 doing? #Part-5"))
print(char_filter("How~ are123 sdadsad you#123 doing?&%^(9)=+$#"))

How are you doing
How are you doing
How are you doing
How are you doing
How are you doing Part
How are sdadsad you doing


**Custom function to preprocess the strings**
- lower case -> tokenize -> stem tokens -> join the tokens into a single string again -> char filter

In [16]:
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()

    # Tokenize the text
    tokens = word_tokenize(text)

    # Remove stop words
    stop_words = set(stopwords.words('english'))

    tokens = [token for token in tokens if token not in stop_words]

    # Stem the tokens
    porter_stemmer = PorterStemmer()
    tokens = [porter_stemmer.stem(token) for token in tokens]

    # Join the tokens back into a single string
    preprocessed_text = ' '.join(tokens)

    preprocessed_text = char_filter(preprocessed_text)

    return preprocessed_text

In [17]:
corpus_clean = [preprocess_text(x) for x in corpus]
corpus_clean[:5]

['step step guid invest share market india',
 'step step guid invest share market',
 'stori kohinoor koh i noor diamond',
 'would happen indian govern stole kohinoor koh i noor diamond back',
 'increas speed internet connect use vpn']

#### Generate Embeddings

In [18]:
print("Loading the embeddings generation model")

mpnet = SentenceTransformer('all-mpnet-base-v2')

Loading the embeddings generation model


.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

**Testing the SentenceTransformer**
- `mpnet.encode()` method takes a text input and returns the corresponding embedding vector.
- The embedding vector represents the semantic content of the input sentence in a numerical form.

In [19]:
print(len(mpnet.encode("Hello World")))
print(type(mpnet.encode("Hello World")))
print(mpnet.encode("Hello World").shape)

768
<class 'numpy.ndarray'>
(768,)


**Vector length is irrespective of the string length**

In [20]:
print(len(mpnet.encode("Hello World")))
print(len(mpnet.encode("Hello World I am Subhankar")))

768
768


**Both Upper and Lower case string vectors looks the same**

In [21]:
print(mpnet.encode("Hello World I am Subhankar")[50:100])
print(mpnet.encode("hello world i am subhankar")[50:100])

[-0.00374531 -0.00696284 -0.01749909 -0.00627579  0.00495815 -0.0121107
  0.02140866  0.06013618 -0.0077163  -0.03774397 -0.02784925 -0.04040908
 -0.00811867 -0.03762842  0.06962299  0.05963692  0.01305876  0.01745396
  0.01069919  0.04462854 -0.02044875 -0.05697909  0.01831596  0.05865223
 -0.02623429 -0.04469025 -0.0334723   0.02405169  0.02684478  0.00775201
  0.0650022   0.05600469 -0.01791828 -0.02227903  0.02123912  0.06487815
  0.01856707 -0.02198946 -0.05869648 -0.02880775 -0.04510571 -0.05576394
 -0.03510349 -0.01722613 -0.02693848 -0.01929936  0.03381964 -0.07317714
 -0.03803238 -0.01061553]
[-0.00374531 -0.00696284 -0.01749909 -0.00627579  0.00495815 -0.0121107
  0.02140866  0.06013618 -0.0077163  -0.03774397 -0.02784925 -0.04040908
 -0.00811867 -0.03762842  0.06962299  0.05963692  0.01305876  0.01745396
  0.01069919  0.04462854 -0.02044875 -0.05697909  0.01831596  0.05865223
 -0.02623429 -0.04469025 -0.0334723   0.02405169  0.02684478  0.00775201
  0.0650022   0.05600469 -0

**Two vectors looks different for two strings which is obvious**

In [22]:
print(mpnet.encode("Hello World")[50:100])
print(mpnet.encode("Hello World I am Subhankar")[50:100])

[ 0.00688587 -0.00683096 -0.04876128 -0.02701077  0.01549111  0.03731693
  0.02727933  0.02649895 -0.00169232 -0.02882237  0.02566293 -0.00466161
 -0.02706406 -0.00609543  0.01816671  0.04138831 -0.03703111  0.00554035
  0.01472596  0.06878932  0.02364845 -0.02468448  0.01029455  0.08591811
  0.00808097 -0.08048777 -0.03583585  0.07399857  0.01427154  0.02014245
 -0.01492115  0.02845246  0.00244247  0.00562901 -0.03743974  0.08276888
  0.01304289 -0.02159669  0.00958259 -0.00575337 -0.03633505 -0.00155772
  0.00127762  0.02248829  0.0001466  -0.03706812  0.0059373   0.0148295
  0.02463037 -0.09740084]
[-0.00374531 -0.00696284 -0.01749909 -0.00627579  0.00495815 -0.0121107
  0.02140866  0.06013618 -0.0077163  -0.03774397 -0.02784925 -0.04040908
 -0.00811867 -0.03762842  0.06962299  0.05963692  0.01305876  0.01745396
  0.01069919  0.04462854 -0.02044875 -0.05697909  0.01831596  0.05865223
 -0.02623429 -0.04469025 -0.0334723   0.02405169  0.02684478  0.00775201
  0.0650022   0.05600469 -0

**`normalize` does not take 1D array, so packing the vector inside another array explicitly**

In [23]:
print(normalize([mpnet.encode("Hello World")]).shape)

(1, 768)


**We want to capture the normalized data from the first row itself**
- The normalized data has 1 row and 768 columns

In [24]:
print(normalize([mpnet.encode("Hello World")])[0].shape)

(768,)


**Normalized vs Non-Normalized - They looks the same**

In [25]:
print(normalize([mpnet.encode("Hello World")])[0][50:100])
print(mpnet.encode("Hello World")[50:100])

[ 0.00688587 -0.00683096 -0.04876128 -0.02701077  0.01549111  0.03731693
  0.02727933  0.02649895 -0.00169232 -0.02882237  0.02566294 -0.00466161
 -0.02706406 -0.00609543  0.01816671  0.04138831 -0.03703112  0.00554035
  0.01472596  0.06878932  0.02364845 -0.02468448  0.01029455  0.08591811
  0.00808097 -0.08048778 -0.03583586  0.07399857  0.01427154  0.02014245
 -0.01492115  0.02845246  0.00244247  0.00562901 -0.03743975  0.08276888
  0.01304289 -0.02159669  0.00958259 -0.00575337 -0.03633506 -0.00155772
  0.00127762  0.02248829  0.0001466  -0.03706812  0.0059373   0.0148295
  0.02463037 -0.09740084]
[ 0.00688587 -0.00683096 -0.04876128 -0.02701077  0.01549111  0.03731693
  0.02727933  0.02649895 -0.00169232 -0.02882237  0.02566293 -0.00466161
 -0.02706406 -0.00609543  0.01816671  0.04138831 -0.03703111  0.00554035
  0.01472596  0.06878932  0.02364845 -0.02468448  0.01029455  0.08591811
  0.00808097 -0.08048777 -0.03583585  0.07399857  0.01427154  0.02014245
 -0.01492115  0.02845246  

**Compact version of the code**

In [26]:
vectors = []

for sentence in tqdm(corpus[:6]):
    vectors.append(normalize([(mpnet.encode(sentence.lower()))])[0])

print(vectors[0][:20])
print(type(vectors[0][:20]))

100%|██████████| 6/6 [00:00<00:00,  8.30it/s]

[ 0.01481656 -0.06528713 -0.03090182 -0.08895141  0.01557142 -0.013979
 -0.01647377  0.01579142  0.00844368  0.02284859  0.00886264  0.01661865
 -0.01031052  0.08167528  0.00267152  0.00580356  0.01398365 -0.02244049
 -0.08187897  0.01720893]
<class 'numpy.ndarray'>





**More readable version**

In [27]:
vectors = []

for sentence in tqdm(corpus[:6]):
    vector = mpnet.encode(sentence.lower())
    normalized_vector = normalize([vector])[0]
    vectors.append(normalized_vector)

vectors[0][:20]

100%|██████████| 6/6 [00:00<00:00,  8.39it/s]


array([ 0.01481656, -0.06528713, -0.03090182, -0.08895141,  0.01557142,
       -0.013979  , -0.01647377,  0.01579142,  0.00844368,  0.02284859,
        0.00886264,  0.01661865, -0.01031052,  0.08167528,  0.00267152,
        0.00580356,  0.01398365, -0.02244049, -0.08187897,  0.01720893])

**Doing for the whole corpus**

In [28]:
vectors = []

for sentence in tqdm(corpus):
    vector = mpnet.encode(sentence.lower())
    normalized_vector = normalize([vector])[0]
    vectors.append(normalized_vector)

100%|██████████| 100000/100000 [3:16:44<00:00,  8.47it/s]


#### BM25 - Fit BM25 to corpus

In [29]:
#we need to tokenize the corpus because bm25 accepts only tokenized input
tokenized_corpus = [doc.split(" ") for doc in corpus_clean]

print(tokenized_corpus[:6])

[['step', 'step', 'guid', 'invest', 'share', 'market', 'india'], ['step', 'step', 'guid', 'invest', 'share', 'market'], ['stori', 'kohinoor', 'koh', 'i', 'noor', 'diamond'], ['would', 'happen', 'indian', 'govern', 'stole', 'kohinoor', 'koh', 'i', 'noor', 'diamond', 'back'], ['increas', 'speed', 'internet', 'connect', 'use', 'vpn'], ['internet', 'speed', 'increas', 'hack', 'dn']]


In [30]:
bm25 = BM25Okapi(tokenized_corpus)

#### Setting up the BM25 and Semantic Search Functions:

In [31]:
def get_bm25_search_hits(query, chunks, vectors, n = 50):
    # filtering, preprocessing and tokenizing the query string
    tokenized_query = preprocess_text(query).split()

    # getting document indices based on BM25 top scores in descending order
    top_n = np.argsort(bm25.get_scores(tokenized_query), axis=0)[::-1]

    # list to store documents and corresponding embedding vectors
    bm25_search = []
    focussed_vectors = []

    # extracting document from corpus and corresponding embedding vector based on index number
    for idx in top_n:
        if (len(bm25_search) <= n):
    #         print(chunks[idx])
    #         print("-----------")
            bm25_search.append(chunks[idx])
            focussed_vectors.append(vectors[idx])
    return bm25_search, focussed_vectors

**Dot product is same as Cosine Similarity if vectors are normalized**

In [32]:
def semantic_search(query, bm25_search, focussed_vectors, top_hits = 5):
    # Encode the query and normalize the embedding
    # print(preprocess_text(query))
    # query_embedding = mpnet.encode(preprocess_text(query))

    query_embedding = mpnet.encode(query.lower())
    query_embedding = normalize([query_embedding])[0]

    # Compute the similarity scores between the query and the documents
    similarity_scores = np.dot(focussed_vectors, query_embedding.T)

    # Rank the documents based on their similarity scores (getting the indices)
    ranked_documents = np.argsort(similarity_scores, axis=0)[::-1]

    # Print the top_hits results
    for i in range(top_hits):
        print(bm25_search[ranked_documents[i]])
        print("Similarity Score:", similarity_scores[ranked_documents[i]])
        print("-"*50)

#### Only MB25 vs (BM25+Embeddings) or (BM25+Semantic)

In [37]:
query = "England Sports"

In [38]:
bm25_search, focussed_vectors = get_bm25_search_hits(query = query, chunks = corpus, vectors=vectors, n = 50)

In [39]:
print("Semantic Search\n")
semantic_search(query, bm25_search, focussed_vectors, top_hits=10)
print("\n")

Semantic Search

How many football clubs are there in England ?
Similarity Score: 0.5842164634309986
--------------------------------------------------
What are the most popular sports in India?
Similarity Score: 0.5247922628896764
--------------------------------------------------
Who will win the England v Wales match?
Similarity Score: 0.52393072015134
--------------------------------------------------
Why is Australia good in all sports?
Similarity Score: 0.5059840291385085
--------------------------------------------------
Which is the most popular sport in Europe and why?
Similarity Score: 0.5054940585686407
--------------------------------------------------
England v. Wales - who will win?
Similarity Score: 0.4983591279861246
--------------------------------------------------
Why is Australia so good at sports?
Similarity Score: 0.49700631794973055
--------------------------------------------------
What is the weirdest sport?
Similarity Score: 0.46312886927304486
---------------

In [40]:
print("BM25 Search")
bm25_search[:10]

BM25 Search


['Why is the Queen of England the Queen of England?',
 'Is England in Britain? What is the difference between England, Britain and the UK?',
 'What is Londonderry, England known for?',
 'What is your favorite sport?',
 'Why is NASCAR a sport?',
 'What is the weirdest sport?',
 'What are your favorite sports?',
 'What are the weirdest sports?',
 'Can sports be arts?',
 'How is fishing a sport?']

#### Exporting Models

In [56]:
quora_dict = {
    "questions" : corpus
}

In [57]:
quora_dict["questions"][:5]

['What is the step by step guide to invest in share market in india?',
 'What is the step by step guide to invest in share market?',
 'What is the story of Kohinoor (Koh-i-Noor) Diamond?',
 'What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back?',
 'How can I increase the speed of my internet connection while using a VPN?']

In [59]:
# exporting the dataset

with open("quora_questions_dataset.json", "w") as file:
    json.dump(quora_dict, file, indent=4)

In [45]:
joblib.dump(bm25, 'bm25_model.joblib')

['bm25_model.joblib']

In [46]:
# Save the array to an HDF5 file
with h5py.File('embedding_vector.h5', 'w') as hf:
    hf.create_dataset('data', data=vectors)

#### Downloading the embedding_vector file

In [60]:
from google.colab import files

files.download("embedding_vector.h5")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>