# TF-IDF

**TF-IDF** (Term Frequency-Inverse Document Frequency) is a method used to calculate the importance of words in multiple documents. It helps determine the significance of words within a document and can be used to measure the similarity between documents.

## Concept
- **TF (Term Frequency)**: Indicates how frequently a specific word appears in a document.
                            E.g., if the word "cat" appears 10 times in a document, the TF value is 10.
  
- **IDF (Inverse Document Frequency)**: Represents the inverse of how frequently a word appears across multiple documents. It is calculated as:

  $$
  \text{IDF} = \log\left(\frac{\text{Total number of documents}}{\text{Number of documents containing the word}}\right)
  $$

  Words that appear frequently in many documents are considered less important, and IDF adjusts for this.
  
- **TF-IDF**: The product of TF and IDF values, which represents the importance of a specific word in a document. A high TF-IDF value indicates that the word appears often in a particular document but not in others, making it more significant.

## Application
TF-IDF values are used to measure document similarity. Techniques like **cosine similarity** or **clustering** can be applied to determine how similar documents are to each other.

## Example
Consider three documents. 
1. Tom plays soccer. (Doc1)
2. Tom loves soccer and baseball. (Doc2)
3. baseball is his hobby and his job. (Doc3)

**TF Calculation**

![Calculation of TF](https://miro.medium.com/v2/resize:fit:1400/format:webp/1*_LxRJqXRjJNm95-R7B-xGQ.png)

**IDF Calculation**

![Calculation of TF](https://miro.medium.com/v2/resize:fit:828/format:webp/1*AyiKGi5VfEikGdw_Tg9TdA.png)
 
For instance, the word "Tom" has an IDF value of:

$$
\text{IDF} = \log\left(\frac{3}{2}\right) \approx 0.18
$$

**TF-IDF Calculation**

![Calculation of TF](https://miro.medium.com/v2/resize:fit:828/format:webp/1*zHRhIUR7XOmVSIkZbuU_iA.png)

The word "his" has a TF value of 2 and an IDF value of 0.48, resulting in a TF-IDF score of:

$$
\text{TF-IDF} = 2 \times 0.48 = 0.96
$$

In [13]:
import nltk
nltk.download("stopwords", quiet=True)
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import numpy as np

X = [
"The cat and the dog play in the garden",
"The dog chases the cat around the house",
"The house has a beautiful garden with flowers"
]

tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(X)

print("\n Word List:")
print(tfidf_vectorizer.get_feature_names_out())

df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())
print("\nTF-IDF Matrix:")
print(df)

for i, doc in enumerate(X):
    print(f"\nDoc {i+1}: {doc}")
    for term in tfidf_vectorizer.get_feature_names_out():
        tf_idf_val = tfidf_matrix[i, tfidf_vectorizer.vocabulary_[term]]
        if tf_idf_val > 0:
            print(f"  {term}: {tf_idf_val:.4f}")


 Word List:
['beautiful' 'cat' 'chases' 'dog' 'flowers' 'garden' 'house' 'play']

TF-IDF Matrix:
   beautiful       cat    chases       dog   flowers    garden     house  \
0   0.000000  0.459854  0.000000  0.459854  0.000000  0.459854  0.000000   
1   0.000000  0.459854  0.604652  0.459854  0.000000  0.000000  0.459854   
2   0.562829  0.000000  0.000000  0.000000  0.562829  0.428046  0.428046   

       play  
0  0.604652  
1  0.000000  
2  0.000000  

Doc 1: The cat and the dog play in the garden
  cat: 0.4599
  dog: 0.4599
  garden: 0.4599
  play: 0.6047

Doc 2: The dog chases the cat around the house
  cat: 0.4599
  chases: 0.6047
  dog: 0.4599
  house: 0.4599

Doc 3: The house has a beautiful garden with flowers
  beautiful: 0.5628
  flowers: 0.5628
  garden: 0.4280
  house: 0.4280


----------------

# Bag of Words (BoW)

### Concept
**Bag of Words (BoW)** is a model that creates features by assigning frequency values to words without considering their order or context. The term "Bag of Words" comes from the concept of putting all the words from a document into a "bag" and mixing them up, ignoring their sequence or grammatical structure.

### Feature Extraction Example
Let's consider the following three sentences as an example to illustrate BoW-based feature extraction:

- Sentence 1: `'Tom plays soccer'`
- Sentence 2: `'Tom loves soccer and baseball'`
- Sentence 3: `'baseball is his hobby and his job'`

1. **List Words and Assign Index**  
   First, list all the unique words from these sentences and assign each word a unique index:

    'Tom': 0, 'plays': 1, 'soccer': 2, 'loves': 3, 'and': 4, 'baseball': 5, 'is': 6, 'his': 7, 'hobby': 8, 'job': 9

2. **Frequency Vector for Each Sentence**  
Next, create a vector for each sentence based on the frequency of each word:

- Sentence 1: `[1, 1, 1, 0, 0, 0, 0, 0, 0, 0]`
- Sentence 2: `[1, 0, 1, 1, 1, 1, 0, 0, 0, 0]`
- Sentence 3: `[0, 0, 0, 0, 0, 1, 1, 2, 1, 1]`

In these vectors, each value represents the frequency of the corresponding word in that sentence.

### Advantages and Disadvantages
**Advantages**: Simple and effective in capturing the overall representation of a document.

**Disadvantages**:
- **Lack of Semantic Context**: BoW ignores the order of words, which means that it cannot understand the context or relationships between words.
- **Sparsity Problem**: The feature vectors created by BoW tend to be sparse, meaning that most of the values in the matrix are zeros. This is known as a **sparse matrix** and can negatively impact machine learning performance.

### Feature Vectorization
process of converting textual data into numerical vectors that can be used by machine learning models. In BoW, this is done by listing all the words from the documents and assigning frequency values to them. If we have M documents and N unique words across those documents, the BoW-based feature vector will result in an M x N matrix.

BoW feature vectorization can be performed using two main approaches:
- **Count-based vectorization (CountVectorizer)**
- **TF-IDF (Term Frequency-Inverse Document Frequency) vectorization**

In [14]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

X = [
"The cat and the dog play in the garden",
"The dog chases the cat around the house",
"The house has a beautiful garden with flowers"
]

count_vectorizer = CountVectorizer(stop_words='english')
count_matrix = count_vectorizer.fit_transform(X)

df = pd.DataFrame(count_matrix.toarray(), columns=count_vectorizer.get_feature_names_out())

print("\n Word List:")
print(count_vectorizer.get_feature_names_out())

print("\nCount Matrix:")
print(df)

for i, doc in enumerate(X):
    print(f"\nDoc {i+1}: {doc}")
    for term in count_vectorizer.get_feature_names_out():
        count_val = count_matrix[i, count_vectorizer.vocabulary_[term]]
        if count_val > 0:
            print(f"  {term}: {count_val}")


 Word List:
['beautiful' 'cat' 'chases' 'dog' 'flowers' 'garden' 'house' 'play']

Count Matrix:
   beautiful  cat  chases  dog  flowers  garden  house  play
0          0    1       0    1        0       1      0     1
1          0    1       1    1        0       0      1     0
2          1    0       0    0        1       1      1     0

Doc 1: The cat and the dog play in the garden
  cat: 1
  dog: 1
  garden: 1
  play: 1

Doc 2: The dog chases the cat around the house
  cat: 1
  chases: 1
  dog: 1
  house: 1

Doc 3: The house has a beautiful garden with flowers
  beautiful: 1
  flowers: 1
  garden: 1
  house: 1


---------------

# FAISS: Facebook AI Similarity Search

FAISS is an open-source library developed by Facebook AI Research for efficient similarity search and clustering of dense vectors. It is particularly designed to handle large-scale datasets containing high-dimensional vectors.

- **High Performance**: Leverages efficient algorithms and hardware acceleration (CPU and GPU support) to achieve fast search times even on billions of vectors.
- **Scalability**: Designed to handle extremely large datasets efficiently.
- **Versatility**: Offers various indexing methods suitable for different data characteristics and performance requirements.

## Mathematical Foundations of FAISS

The core functionality of FAISS revolves around efficient computation of vector similarities or distances and organizing data structures to facilitate rapid search.

### 1. Similarity Measures and Distance Metrics

#### a. Euclidean Distance (L2 Norm)

The Euclidean distance between two vectors $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$ is given by:

$$
d_{\text{L2}}(\mathbf{x}, \mathbf{y}) = \left\| \mathbf{x} - \mathbf{y} \right\|_2 = \sqrt{\sum_{i=1}^n (x_i - y_i)^2}
$$

This metric measures the straight-line distance between two points in Euclidean space.

#### b. Inner Product (Dot Product)

The inner product similarity between two vectors is:

$$
s_{\text{dot}}(\mathbf{x}, \mathbf{y}) = \mathbf{x}^\top \mathbf{y} = \sum_{i=1}^n x_i y_i
$$

This measure is often used when vectors are normalized, relating closely to cosine similarity.

### 2. Nearest Neighbor Search

The primary problem FAISS addresses is the **Nearest Neighbor Search (NNS)**:

Given a dataset $\mathcal{X} = \{\mathbf{x}_1, \mathbf{x}_2, \dots, \mathbf{x}_N\}$ and a query vector $\mathbf{q}$, find the vector(s) in $\mathcal{X}$ closest to $\mathbf{q}$ according to a specified distance metric.

#### Challenges:
- **Curse of Dimensionality**: As the dimensionality $n$ increases, the computational cost of exact search becomes prohibitive.
- **Large Datasets**: With massive $N$, linear search is impractical.

**Solution**: FAISS employs Approximate Nearest Neighbor (ANN) algorithms that trade off a bit of accuracy for significant gains in efficiency.

### 3. Indexing Structures and Algorithms

FAISS provides several indexing structures to accelerate search. Below are the key mathematical concepts behind these structures.

#### a. Product Quantization (PQ)

**Objective**: Compress high-dimensional vectors to reduce memory usage and accelerate distance computations.

**Method**:

1. **Vector Subspace Decomposition**:
   Split the original $n$-dimensional space into $m$ lower-dimensional subspaces. Each vector $\mathbf{x}$ is partitioned:

   $$
   \mathbf{x} = [\mathbf{x}^{(1)}, \mathbf{x}^{(2)}, \dots, \mathbf{x}^{(m)}]
   $$

   where $\mathbf{x}^{(i)} \in \mathbb{R}^{n/m}$.

2. **Subspace Quantization**:
   For each subspace, build a codebook by clustering (e.g., using $k$-means):
   - For subspace $i$, find $k$ cluster centers $\{\mathbf{c}_1^{(i)}, \dots, \mathbf{c}_k^{(i)}\}$.

3. **Encoding Vectors**:
   Each sub-vector $\mathbf{x}^{(i)}$ is quantized to the nearest centroid:

   $$
   q^{(i)}(\mathbf{x}^{(i)}) = \arg\min_{j} \left\| \mathbf{x}^{(i)} - \mathbf{c}_j^{(i)} \right\|_2
   $$

   The original vector $\mathbf{x}$ is then represented by the set of indices $[q^{(1)}(\mathbf{x}^{(1)}), \dots, q^{(m)}(\mathbf{x}^{(m)})]$.

**Distance Approximation**:

During search, the distance between a query $\mathbf{q}$ and a database vector $\mathbf{x}$ is approximated as:

$$
d_{\text{PQ}}(\mathbf{q}, \mathbf{x}) \approx \sum_{i=1}^m \left\| \mathbf{q}^{(i)} - \mathbf{c}_{q^{(i)}(\mathbf{x}^{(i)})}^{(i)} \right\|_2^2
$$

Precomputing the distances between query sub-vectors and codebook centroids allows efficient computation.

#### b. Inverted File Index (IVF)

**Objective**: Reduce the number of candidate vectors to consider during search by partitioning the dataset.

**Method**:

1. **Coarse Quantization**:
   Cluster the entire dataset into $K$ coarse clusters using $k$-means clustering in the original space. Each cluster corresponds to an inverted list.

2. **Assignment**:
   Assign each database vector $\mathbf{x}$ to its nearest coarse centroid $\mathbf{C}(\mathbf{x})$.

3. **Search**:
   - **Query Assignment**: Assign the query $\mathbf{q}$ to its $L$ nearest coarse centroids.
   - **Candidate Retrieval**: Consider only the vectors in the corresponding $L$ inverted lists for further evaluation.

#### c. Hierarchical Navigable Small World Graphs (HNSW)

**Objective**: Provide efficient approximate nearest neighbor search using a graph-based approach.

**Method**:

1. **Graph Construction**:
   - Build a hierarchical graph with multiple layers.
   - **Upper Layers**: Contain a subset of the data points with long-range links.
   - **Lower Layers**: Include all data points with links to nearby neighbors.

2. **Navigability**:
   - Edges are created between nodes based on proximity, using heuristics to maintain a "small-world" property, ensuring logarithmic search time.

3. **Search Algorithm**:
   - **Greedy Search**: Starting from an entry point at the top layer, iteratively move to neighbors closer to the query $\mathbf{q}$.
   - **Layer Traversal**: Once the lowest layer is reached, the search continues within that layer to refine the nearest neighbors.

#### d. Optimized Product Quantization (OPQ)

An extension of PQ that applies a rotation (or linear transformation) to the data before quantization to minimize quantization error.

**Mathematical Formulation**:

1. **Find Optimal Rotation Matrix** $\mathbf{R}$:
   Solve:

   $$
   \min_{\mathbf{R}} \sum_{i=1}^N \left\| \mathbf{x}_i - \mathbf{R}^\top q(\mathbf{R} \mathbf{x}_i) \right\|_2^2
   $$

   where $q(\cdot)$ is the quantization function.

2. **Rotation and Quantization**:
   - Rotate vectors: $\mathbf{z}_i = \mathbf{R} \mathbf{x}_i$.
   - Apply PQ on rotated vectors $\mathbf{z}_i$.