## FAISS

FAISS is a popular tool for efficient similarity search over vector embeddings. It is often used in machine learning, NLP and RAG applications. FAISS implements fast nearest neighbor search to provide a faster output. 

Before learning about Fash Nearest Neighbor Search, there are few more concepts that we need to be aware of. 

### Nearest Neighbor Search (NNS)
Nearest Neighbor Search can be used to retrieve documents relevent to the query. Nearest Neighbor Search looks for vectors closest to the query vector to retrieve the documents. It uses the either of the below distance metrics to retrieve the documents. 

1. Cosine similarity (angle between vectors)
2. Euclidean distance (L2 norm)

There are two types of Nearest Neighboar Search

- Exact Nearest Neighbor Search (ENN)

    This is a brute force search. It compares the query vector with every vector in the database. While this is 100% accurate, it is too slow for millions of embeddings. 

- Approxiamate Nearest Neighbor Search (ANN)

    ANN uses clever data structures like trees, graphs or clustering to speed things up. It trades a tiny bit of accuracy for huge performance gains. ANN is widely used in RAG applications. 

### Embedding Space
When we generate embeddings (say using OpenAI Sentence Transformers, or CLIP), each peice of text, image or item is mapped to a vector in a high-dimensional space. (384D, 768D, 1536D). Think of it as a cloud of points floating in a space. The distance between points reflects sementic similarity (close = similar, far = different)

### Distance Metrics 
Distance Metrics are various approaches that can be followed to find the distance between the vectors in Nearest Neighbor Search. The distance between the vectors determine the similarity between them. Closer the vectors, more similar they are. There are various distance metrics that are used.

1. Euclidean Distance (L2 Norm)

    - Formulae

        ![alt text](euclidean-distance-formulae.jpg "Euclidean Distance")
    
    - Measures the straight line geometric distance between two vectors. 
    - Intuition: "How apart are the points in space?"
    - Usecase: Works well when both the magnitude and the direction of the vectors matter. (Ex: Clustering images or sensor data)

2. Manhattan Distance (L1 Norm)

    - Formulae

        ![alt text](manhattan-distance-formulae.jpg "Manhattan Distance")
    
    - Measures distance as if we can only move along the grid lines. Like city blocks in Manhattan.
    - Intuition: More robust to outliers than euclidean distance. (I doubt this)
    - Usecase: Sometimes used in sparse embeddings or table data

3. Cosine Similarity / Cosine Distance

    - Formulae

        ![alt text](cosine-similarity-formulae.jpg "Cosine Similarity")

        Cosine Similarity = 1 - sim(x,y)

    - Focuses on the angle between the vectors
    - Intuition: Two vectors pointing in the same direction are similar even though they differ in length. 
    - Usecase: Very common in sementic embeddings (text embeddings, document search, RAG memory) since meaning is captured by direction and not length.


### Vector Normalization
Vector Normalization means, scaling down a vector such that its own length (magnitude) becomes 1. After normalization, all vectors lie on the same unit sphere in the embedding space. This ensures that no vector dominates purely because it has a larger magnitude. Some embedding models produces vectors of different lengths even for similar meanings. Normalization ensures only direction (semantic meaning) matters. Lets talk about this in detail. 

 We all know, a vector is just a list of numbers representing some objects (like text, or an image). Example:

 - Vector for "doberman" -> [3,4]
 - Vector for "pug" -> [6,8]

Both these vectors are dogs. But, the vector for put is double the value of doberman, which may give a impression that they are not related to each other. Vector normalization removes this difference. 

Vector normalization scales the vectors such that their length (magnitude) becomes 1 without changing its direction. 

The formula for normalizing a vector is Vn = V/Squareroot(Sum of squares of all vectors)

For the above example, the vector for doberman would be (3/SqRoot(sq(3) + sq(4)),4/SqRoot(sq(3) + sq(4))) which would be [0.6,0.8]

If we add the same formula for pug, the normalized vector would be [0.6, 0.8]

### Indexes in FAISS
FAISS provides different types of indexes for storing vectors, each with its own trade-offs (accruacy, speed, memory)

1. Flat Index
    - 

### Fast Nearest Neighbor Search
Nearest Neighbor Search looks for vectors closest to the query vector. In a large RAG applications, where we may have millions of document embeddings, a naive search would compute distances of the query to every single vector which would be very slow. 

Fast Nearest Neighbor Search uses optimized datastructures and algorithms. Fast Nearest Neighbor Search is a general term for any optimization that makes nearest neighbor retrieval quicker than brute force. The below optimization techniques are generally employed in Fast Nearest Neighbor Search

1. Hardware Level optimizations
2. Indexing Structures for Exact NNS
3. Approxiamate Nearest Neighbor (ANN) Structures

### FAISS

FAISS stands for Facebook AI Similarity Search. It is a library for fast nearest neighbor search in high-dimensional vector spaces. FAISS provides specialized data structures and algorithms to make Nearest Neighbor Search (NNS) faster. 

#### Core Features of FAISS
