## 1. Introduction to Semantic Search

Semantic search is an advanced information retrieval technique that aims to understand the intent and contextual meaning behind a user's query, rather than simply matching keywords. This approach represents a significant leap forward from traditional keyword-based search methods, enabling more accurate and relevant results.

### 1.2 Semantic Search vs. Traditional Keyword-Based Search

#### Traditional Keyword-Based Search:
- Matches documents containing the exact keywords from the query.
- Often requires users to know the right keywords to find relevant information.
- Can miss relevant documents that use synonyms or related concepts.
- May return irrelevant results if keywords are used in different contexts.

#### Semantic Search:
- Understands the meaning behind the query and finds documents that are semantically related.
- Allows users to use natural language queries.
- Can find relevant documents even if they don't contain the exact query terms.
- Considers the context of the search and the user's intent.

### Example:
Consider a user searching for "apple". 

A traditional search might return results about:
- The fruit
- Apple Inc. (the tech company)
- Apple records (the record label)
- Recipes containing apples
- News articles mentioning "apple" in various contexts

In contrast, a semantic search would attempt to understand the context:
- If the user has recently been browsing tech websites, it might prioritize results about Apple Inc.
- If the query is "how to cook an apple pie", it would understand that the apple in question is the fruit.
- If the query is "apple stock price", it would interpret this as relating to the company's financial information.

## 2. Key Concepts in Semantic Search

### 2.1 Vector Space Model (VSM)

The Vector Space Model is a fundamental concept in information retrieval and forms the basis for many semantic search techniques. In VSM, we represent text data (words, documents, queries) as vectors in a high-dimensional space. This mathematical representation allows us to perform various operations and comparisons on text data.

Key points about VSM:

- Each dimension in the vector space typically corresponds to a term in the vocabulary.
- Documents and queries are represented as vectors in this space.
- The similarity between documents, or between a query and a document, can be measured by the proximity of their vector representations.

Mathematically, we can represent this as follows:

Let $D = \{d_1, d_2, ..., d_n\}$ be a set of documents and $T = \{t_1, t_2, ..., t_m\}$ be the set of terms in the vocabulary.

A document $d_i$ is represented as a vector:

$$d_i = (w_{i1}, w_{i2}, ..., w_{im})$$

where $w_{ij}$ is the weight of term $t_j$ in document $d_i$.

The VSM allows us to translate the problem of determining the relevance of a document to a query into a problem of measuring the similarity between vectors. This forms the foundation for many more advanced techniques in semantic search.

### 2.2 Embeddings and Representation of Text

Embeddings are dense vector representations of words, phrases, or entire documents that capture semantic information. The key idea behind embeddings is that words or documents with similar meanings should have similar vector representations.

Characteristics of embeddings:

- They are typically low-dimensional (compared to the size of the vocabulary), dense vectors.
- Similar meanings result in vectors that are close to each other in the embedding space.
- They can capture complex relationships between words, such as analogies.

For example, in a well-trained word embedding space:
- The vector for "king" - "man" + "woman" should be close to the vector for "queen".
- The vector for "Paris" - "France" + "Italy" should be close to the vector for "Rome".

These embeddings form the backbone of many modern semantic search systems, allowing for more nuanced understanding and comparison of text data.

## 3. Mathematical Principles of Semantic Search

### 3.1 Vector Representations

As mentioned earlier, in semantic search, we represent words, phrases, or entire documents as vectors in a high-dimensional space. This is a crucial step that allows us to apply mathematical operations to text data.

Formally, each document or query is represented as a vector $\mathbf{v} \in \mathbb{R}^n$, where $n$ is the dimensionality of the embedding space.

The choice of how to create these vector representations is a key factor in the performance of a semantic search system. Various methods exist, from simple techniques like bag-of-words to more advanced approaches like word embeddings and contextual embeddings.

### 3.2 Similarity Metrics

Once we have vector representations, we need ways to measure how similar or different these vectors are. This is crucial for determining which documents are most relevant to a given query. Several similarity metrics are commonly used in semantic search:

#### 3.2.1 Cosine Similarity

Cosine similarity measures the cosine of the angle between two vectors. It's widely used because it's efficient to calculate and is not affected by the magnitude of the vectors (only their direction matters).

$$\text{cosine\_similarity}(\mathbf{v_1}, \mathbf{v_2}) = \frac{\mathbf{v_1} \cdot \mathbf{v_2}}{\|\mathbf{v_1}\| \|\mathbf{v_2}\|}$$

where $\cdot$ denotes the dot product and $\|\mathbf{v}\|$ is the Euclidean norm of vector $\mathbf{v}$.

#### 3.2.2 Euclidean Distance

Euclidean distance measures the straight-line distance between two points in Euclidean space. In the context of semantic search, a smaller distance indicates greater similarity.

$$\text{euclidean\_distance}(\mathbf{v_1}, \mathbf{v_2}) = \|\mathbf{v_1} - \mathbf{v_2}\| = \sqrt{\sum_{i=1}^n (v_{1i} - v_{2i})^2}$$

#### 3.2.3 Dot Product

The dot product is another simple and efficient way to measure similarity. For normalized vectors, it's equivalent to cosine similarity.

$$\text{dot\_product}(\mathbf{v_1}, \mathbf{v_2}) = \mathbf{v_1} \cdot \mathbf{v_2} = \sum_{i=1}^n v_{1i} v_{2i}$$

### 3.3 Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents. It's based on the idea that:
- Words that appear frequently in a document are important to that document (term frequency).
- But words that appear frequently across many documents are less discriminative (inverse document frequency).

Mathematically, for a term $t$ in a document $d$:

$$TF(t,d) = \frac{\text{Number of times } t \text{ appears in } d}{\text{Total number of terms in } d}$$

$$IDF(t) = \log\frac{\text{Total number of documents}}{\text{Number of documents containing } t}$$

$$TF\text{-}IDF(t,d) = TF(t,d) \times IDF(t)$$

TF-IDF can be used to create document vectors, where each dimension corresponds to a term's TF-IDF score. This method provides a good baseline for many information retrieval tasks and is still relevant in semantic search, often used in combination with more advanced techniques.

### 3.4 Word Embeddings

Word embeddings are dense vector representations of words that capture semantic meaning.

#### Mathematical Formulation:

Let $V$ be the vocabulary size and $d$ be the embedding dimension. The embedding matrix $E \in \mathbb{R}^{V \times d}$ contains a $d$-dimensional vector for each word in the vocabulary.

For a word $w$, its embedding $e_w$ is:

$$e_w = E[w] \in \mathbb{R}^d$$

Word2Vec uses two main architectures:

1. Continuous Bag of Words (CBOW): Predicts a target word given its context.
2. Skip-gram: Predicts the context given a target word.

For the Skip-gram model, the objective is to maximize:

$$\frac{1}{T} \sum_{t=1}^T \sum_{-c \leq j \leq c, j \neq 0} \log p(w_{t+j}|w_t)$$

where $c$ is the size of the context window, $T$ is the number of words in the corpus, and $p(w_{t+j}|w_t)$ is modeled using the softmax function:

$$p(w_O|w_I) = \frac{\exp(e_{w_O}^T e_{w_I})}{\sum_{w' \in V} \exp(e_{w'}^T e_{w_I})}$$

Here, $w_I$ is the input word and $w_O$ is the output (context) word.

### 3.5 Document Embeddings

Document embeddings extend the idea of word embeddings to entire documents. A simple approach is to average the word embeddings of all words in the document:

$$d = \frac{1}{N} \sum_{i=1}^N e_{w_i}$$

where $N$ is the number of words in the document and $e_{w_i}$ is the embedding of the $i$-th word.

More sophisticated models like Doc2Vec learn document embeddings directly, similar to Word2Vec but with an additional document vector.

### 3.6 Latent Semantic Analysis (LSA)

LSA is a technique that uncovers the underlying semantic structure in a document-term matrix using Singular Value Decomposition (SVD). It's based on the idea that words used in similar contexts tend to have similar meanings.

Mathematically, LSA applies SVD to the term-document matrix:

$$X = U\Sigma V^T$$

where:
- $X$ is the term-document matrix
- $U$ and $V$ are orthogonal matrices
- $\Sigma$ is a diagonal matrix of singular values

By keeping only the $k$ largest singular values, we obtain a low-rank approximation that captures the most important semantic relationships. This reduces noise and reveals the latent semantic structure of the documents.

## 4. Models Used in Semantic Search

Semantic search leverages various models to understand and process text data. These models range from traditional statistical approaches to advanced neural network architectures. Understanding these models is crucial for grasping how semantic search systems work and how they've evolved over time.

### 4.1 TF-IDF Model

While not a deep learning technique, TF-IDF (Term Frequency-Inverse Document Frequency) is still relevant in many semantic search applications and forms a good starting point for understanding more complex models.

#### How TF-IDF Works in Semantic Search:

1. **Document Representation**: Each document is represented as a vector where each element corresponds to a term in the vocabulary. The value of each element is the TF-IDF score of that term in the document.

2. **Query Processing**: The user's query is also converted into a TF-IDF vector.

3. **Similarity Calculation**: The similarity between the query vector and document vectors is calculated, often using cosine similarity.

4. **Ranking**: Documents are ranked based on their similarity to the query.

#### Advantages of TF-IDF in Semantic Search:

- Simple and computationally efficient
- Provides a good baseline for many information retrieval tasks
- Can be combined with more advanced techniques for improved performance

#### Limitations:
- Doesn't capture word order or context
- Cannot understand synonyms or related concepts that don't share exact terms

### 4.2 Word2Vec

Word2Vec is a groundbreaking model that learns vector representations of words (word embeddings) from large corpora of text.

#### How Word2Vec Works in Semantic Search:

1. **Training**: Word2Vec is trained on a large corpus of text to learn word embeddings.

2. **Document Representation**: Documents are represented by combining the embeddings of their constituent words (e.g., by averaging).

3. **Query Processing**: The query is similarly converted into a vector by combining word embeddings.

4. **Similarity Calculation**: As with TF-IDF, similarity between query and document vectors is calculated.

#### Advantages in Semantic Search:
- Captures semantic relationships between words
- Can handle synonyms and related concepts
- Enables more nuanced similarity calculations

#### Limitations:
- Static embeddings don't account for context-dependent word meanings
- Limited by the quality and diversity of the training corpus

### 4.3 BERT (Bidirectional Encoder Representations from Transformers)

BERT, introduced by Google in 2018, represents a significant leap forward in natural language processing and has had a profound impact on semantic search.

#### How BERT Works in Semantic Search:

1. **Pre-training**: BERT is pre-trained on a large corpus of text using two main tasks:
   - Masked Language Modeling (MLM): Predicting masked words in a sentence.
   - Next Sentence Prediction (NSP): Predicting if one sentence follows another.

2. **Fine-tuning**: The pre-trained BERT model is then fine-tuned on specific tasks, including semantic search.

3. **Contextualized Embeddings**: BERT generates contextualized embeddings for words and sentences, meaning the same word can have different representations depending on its context.

4. **Document and Query Representation**: Both documents and queries are passed through BERT to generate their respective embeddings.

5. **Similarity Calculation**: As with previous models, similarity between query and document embeddings is calculated to rank results.

#### Mathematical Formulation:

For an input sequence $X = [x_1, ..., x_n]$, BERT produces contextualized representations $H = [h_1, ..., h_n]$ where each $h_i \in \mathbb{R}^d$ is a $d$-dimensional vector.

The core of BERT is the self-attention mechanism:

$$Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V$$

where $Q$, $K$, and $V$ are query, key, and value matrices derived from the input embeddings.

#### Advantages in Semantic Search:
- Captures complex, context-dependent meanings of words and phrases
- Can handle long-range dependencies in text
- Achieves state-of-the-art performance on many NLP tasks, including semantic search

#### Limitations:
- Computationally intensive, which can be challenging for real-time applications
- Requires significant amounts of data and computational resources for training

### 4.4 Transformer-based Models in Semantic Search

The success of BERT has led to the development of numerous other transformer-based models, each with its own strengths and applications in semantic search.

#### 4.4.1 Encoder-Only Models

Examples: BERT, RoBERTa, ALBERT

These models focus on understanding and representing input text. They are particularly effective for tasks that require deep comprehension of text, such as document classification or relevance ranking in semantic search.

#### 4.4.2 Decoder-Only Models

Examples: GPT (Generative Pre-trained Transformer), GPT-2, GPT-3

While primarily designed for text generation, decoder-only models can be adapted for semantic search tasks. They excel at tasks like query expansion or generating descriptive search snippets.

#### 4.4.3 Encoder-Decoder Models

Examples: T5 (Text-to-Text Transfer Transformer), BART

These models can be used for various subtasks in semantic search, such as query reformulation or document summarization.

### 4.5 Why Different Model Architectures Matter in Semantic Search

Understanding the differences between encoder-only, decoder-only, and encoder-decoder models is crucial in semantic search because each architecture has its strengths and is suited to different aspects of the search process:

1. **Encoder-Only Models (e.g., BERT)**: 
   - Best for: Understanding the meaning of queries and documents
   - Use in semantic search: Generating high-quality embeddings for both queries and documents, which can be directly compared for relevance ranking

2. **Decoder-Only Models (e.g., GPT)**:
   - Best for: Generating human-like text
   - Use in semantic search: Query expansion (generating alternative phrasings for a query), creating detailed search snippets, or even generating natural language explanations of search results

3. **Encoder-Decoder Models (e.g., T5)**:
   - Best for: Transforming input text into output text
   - Use in semantic search: Query reformulation, translating queries between languages in cross-lingual search, or summarizing documents for more informative search results

## 5. The Semantic Search Process

### 5.1 Indexing

Indexing is the process of preparing documents for efficient search and retrieval.

#### Steps involved:
1. **Text Preprocessing**: Cleaning and normalizing text (e.g., lowercasing, removing punctuation, stemming/lemmatization).
2. **Feature Extraction**: Converting text into numerical features (e.g., TF-IDF vectors, word embeddings).
3. **Dimensionality Reduction**: Optionally reducing the dimensionality of feature vectors (e.g., using techniques like PCA or t-SNE).
4. **Index Structure Creation**: Organizing document vectors in a structure that allows for fast retrieval (e.g., inverted index, vector database).

### 5.2 Query Processing

When a user submits a query, it undergoes processing to match the format of the indexed documents.

#### Steps involved:
1. **Query Preprocessing**: Similar to document preprocessing (cleaning, normalization).
2. **Query Expansion**: Optionally expanding the query with synonyms or related terms.
3. **Query Vectorization**: Converting the query into a vector representation.

### 5.3 Retrieval

This stage involves finding documents that are potentially relevant to the query.

#### Steps involved:
1. **Initial Retrieval**: Using efficient algorithms to retrieve a subset of potentially relevant documents.
2. **Similarity Calculation**: Computing similarity scores between the query vector and document vectors.

### 5.4 Ranking

The retrieved documents are ordered based on their relevance to the query.

#### Steps involved:
1. **Score Calculation**: Computing a relevance score for each retrieved document.
2. **Ranking Algorithm**: Ordering documents based on their scores and potentially other factors.

### 5.5 Result Presentation

The final step involves presenting the ranked results to the user in a meaningful way.

#### Steps involved:
1. **Snippet Generation**: Creating concise summaries of each result.
2. **Result Clustering**: Optionally grouping similar results together.
3. **Diversity Promotion**: Ensuring a diverse set of results is presented.

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim.models import Word2Vec
from transformers import BertTokenizer, BertModel
import torch
from sklearn.metrics.pairwise import cosine_similarity
from transformers import pipeline

In [10]:
documents = [
    "The apple is a sweet fruit.",
    "Apple designs and sells consumer electronics.",
     "Apple's latest software update includes new features.",
    "She picked an apple from the tree in her backyard."
]
query = "Apple's stock price increased after the product launch."

def display_results(model_name, similarity_scores, documents):
    print(f"\n{model_name} Similarity Scores:")
    for idx, score in enumerate(similarity_scores):
        print(f"Document {idx+1}: {score:.4f} -> {documents[idx]}")
    most_relevant_doc_index = similarity_scores.argmax()
    print(f"\nMost Relevant Document: {documents[most_relevant_doc_index]}\n")

def tfidf_model(documents, query):
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(documents)
    query_vec = vectorizer.transform([query])
    similarity_scores = cosine_similarity(query_vec, tfidf_matrix).flatten()
    display_results("TF-IDF", similarity_scores, documents)

def word2vec_model(documents, query):
    tokenized_docs = [doc.lower().split() for doc in documents]
    w2v_model = Word2Vec(sentences=tokenized_docs, vector_size=100, window=5, min_count=1, workers=4)
    
    def average_vector(text):
        vectors = [w2v_model.wv[word] for word in text.lower().split() if word in w2v_model.wv]
        return sum(vectors) / len(vectors) if vectors else torch.zeros(100)

    doc_embeddings = [average_vector(doc) for doc in documents]
    query_embedding = average_vector(query)
    similarity_scores = [cosine_similarity([query_embedding], [doc_emb]).flatten()[0] for doc_emb in doc_embeddings]
    display_results("Word2Vec", torch.tensor(similarity_scores), documents)

def bert_model(documents, query):
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    model = BertModel.from_pretrained('bert-base-uncased')

    def embed_text(text):
        inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=512)
        outputs = model(**inputs)
        return outputs.last_hidden_state[:, 0, :].detach().numpy()
        
    doc_embeddings = [embed_text(doc) for doc in documents]
    query_embedding = embed_text(query)
    similarity_scores = [cosine_similarity(query_embedding, doc_emb).flatten()[0] for doc_emb in doc_embeddings]
    display_results("BERT", torch.tensor(similarity_scores), documents)

def transformer_model(documents, query):
    transformer_model = pipeline('feature-extraction', model='sentence-transformers/all-MiniLM-L6-v2')

    def embed_text(text):
        outputs = transformer_model(text)
        return torch.tensor(outputs[0]).mean(dim=0).unsqueeze(0)

    doc_embeddings = [embed_text(doc) for doc in documents]
    query_embedding = embed_text(query)
    similarity_scores = [cosine_similarity(query_embedding, doc_emb).flatten()[0] for doc_emb in doc_embeddings]
    display_results("Transformer", torch.tensor(similarity_scores), documents)

In [11]:
print(query)

tfidf_model(documents, query)
word2vec_model(documents, query)
bert_model(documents, query)
transformer_model(documents, query)

Apple's stock price increased after the product launch.

TF-IDF Similarity Scores:
Document 1: 0.4791 -> The apple is a sweet fruit.
Document 2: 0.1254 -> Apple designs and sells consumer electronics.
Document 3: 0.1150 -> Apple's latest software update includes new features.
Document 4: 0.3170 -> She picked an apple from the tree in her backyard.

Most Relevant Document: The apple is a sweet fruit.


Word2Vec Similarity Scores:
Document 1: 0.2895 -> The apple is a sweet fruit.
Document 2: -0.0648 -> Apple designs and sells consumer electronics.
Document 3: 0.3926 -> Apple's latest software update includes new features.
Document 4: 0.3193 -> She picked an apple from the tree in her backyard.

Most Relevant Document: Apple's latest software update includes new features.





model.safetensors:  45%|####5     | 199M/440M [00:00<?, ?B/s]


BERT Similarity Scores:
Document 1: 0.8255 -> The apple is a sweet fruit.
Document 2: 0.8286 -> Apple designs and sells consumer electronics.
Document 3: 0.8947 -> Apple's latest software update includes new features.
Document 4: 0.8363 -> She picked an apple from the tree in her backyard.

Most Relevant Document: Apple's latest software update includes new features.



config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]


Transformer Similarity Scores:
Document 1: 0.3928 -> The apple is a sweet fruit.
Document 2: 0.4849 -> Apple designs and sells consumer electronics.
Document 3: 0.4689 -> Apple's latest software update includes new features.
Document 4: 0.2841 -> She picked an apple from the tree in her backyard.

Most Relevant Document: Apple designs and sells consumer electronics.



