<center>
    <h1>Vector Databases</h1>
</center>

# Brief Recap of Vector Databases

* Vector databases are specialized database systems designed to store, index, and query high-dimensional vector data efficiently.

* They revolutionized similarity search and machine learning applications by enabling fast and accurate retrieval of similar items based on their vector representations.

* Vector databases have several advantages over traditional database systems:
  1. Similarity search: They can quickly find the most similar vectors to a given query vector.
  2. High-dimensional data handling: They efficiently manage data with hundreds or thousands of dimensions.
  3. Scalability: Vector databases can handle billions of vectors, making them suitable for large-scale applications.

* Vector databases use various indexing techniques like Approximate Nearest Neighbor (ANN) algorithms to achieve fast search capabilities in high-dimensional spaces.

* They have become foundational for many AI-powered applications, including recommendation systems, image and text search, anomaly detection, and natural language processing tasks.

* Popular vector database systems include Faiss, Milvus, Pinecone, and Weaviate, each offering different features and optimizations.

* These databases continue to evolve, finding new applications across industries such as e-commerce, content recommendation, fraud detection, and scientific research.


<center>
    <img src="static/img1.gif" alt="Vector Databases Example" style="width:50%;">
</center>

## Architecture of Vector Databases

* The Vector Database architecture consists of two main components: the indexing system and the query processor. Here's an overview of the key elements:

1. **Data Ingestion**: Converts input data into high-dimensional vector representations.

2. **Vector Indexing**: Creates efficient data structures for fast similarity search in high-dimensional spaces.

3. **Approximate Nearest Neighbor (ANN) Algorithms**: The core component of Vector Databases, allowing for rapid similarity search in high-dimensional spaces.

4. **Distance Metrics**: Implement various distance measures (e.g., Euclidean, cosine similarity) for comparing vectors.

5. **Dimensionality Reduction**: Techniques to reduce vector dimensions while preserving similarity relationships.

6. **Query Processing**: Handles incoming queries and returns the most similar vectors based on the index.

**Important Points:**

* The indexing system organizes and structures the vector data, while the query processor handles search requests.

* Many vector databases use distributed architectures to handle large-scale data and provide high availability and fault tolerance.

* Advanced features often include CRUD operations, real-time updates, and support for hybrid searches combining vector and scalar data.

<center>
    <img src="static/img2.gif" alt="Working of Vector Databases" style="width:50%;">
</center>

## Applications of Vector Databases

Vector databases have found wide-ranging applications across various domains:

1. **Information Retrieval:**
   - Semantic search
   - Document similarity
   - Content recommendation
   - Plagiarism detection

2. **Computer Vision:**
   - Image similarity search
   - Face recognition
   - Visual product search
   - Image deduplication

3. **Natural Language Processing:**
   - Text classification
   - Question answering systems
   - Language translation
   - Chatbots and conversational AI

4. **Recommender Systems:**
   - Personalized product recommendations
   - Content-based filtering
   - Collaborative filtering at scale

5. **Anomaly Detection:** Identifying unusual patterns in data across various industries.

6. **Bioinformatics:** Analyzing genetic sequences and protein structures.

7. **Financial Services:**
   - Fraud detection
   - Risk assessment
   - Market trend analysis

8. **Audio Processing:**
   - Music recommendation
   - Speaker identification
   - Audio fingerprinting

9. **Cybersecurity:**
   - Malware detection
   - Network intrusion detection
   - Phishing URL detection

10. **E-commerce:**
    - Visual search
    - Product matching
    - Inventory management

11. **Geospatial Analysis:** Processing and querying location-based data efficiently.

12. **Healthcare:**
    - Medical image analysis
    - Drug discovery
    - Patient similarity for personalized medicine

# Implementing some core concepts of building a Vector Database

## Data Ingestion and Preprocessing

- Data ingestion and preprocessing are crucial steps in building a vector database. 
- This process involves collecting raw data from various sources and preparing it for vectorization. 
- It's important because clean, well-structured data leads to more accurate vector representations and better search results.

**Inputs** 
- Raw data from diverse sources (e.g., text documents, images, audio files)

**Process** 
- Clean, normalize, and format the data

**Outputs** 
- Preprocessed data ready for vectorization

Here's a simple example of text preprocessing using Python:

In [3]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/shashwatshahi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/shashwatshahi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/shashwatshahi/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [4]:
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove special characters and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Tokenize
    tokens = word_tokenize(text)
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    return ' '.join(tokens)

raw_text = "This is an example sentence with some numbers (123) and punctuation!"
preprocessed_text = preprocess_text(raw_text)
print(f"Preprocessed text: {preprocessed_text}")

Preprocessed text: example sentence numbers punctuation


## Vector Embedding Generation

- Vector embedding generation is a crucial first step in working with vector databases. 
- It involves converting preprocessed data into numerical vectors that represent the semantic meaning or features of the data. 
- This process is essential because it allows us to quantify and compare the similarity between different pieces of information in a high-dimensional space.

**Inputs** 
- Preprocessed data (e.g., cleaned text, normalized images)

**Process** 
- Use pre-trained models or custom algorithms to generate vector representations

**Outputs** 
- Numerical vectors (embeddings)

Here's a short code snippet demonstrating vector embedding generation for text using the sentence-transformers library:

In [5]:
from sentence_transformers import SentenceTransformer

# Initialize the model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Sample preprocessed text
text = "vector databases powerful tools similarity search"

# Generate embedding
embedding = model.encode(text)

print(f"Embedding shape: {embedding.shape}")
print(f"First few values: {embedding[:5]}")

  from tqdm.autonotebook import tqdm, trange


Embedding shape: (384,)
First few values: [-0.02477177 -0.03728328 -0.06883422 -0.07348053 -0.01428398]


## Vector Indexing

- Vector indexing is the process of organizing vector embeddings in a way that allows for efficient similarity search. 
- It's crucial for scaling vector databases to handle large amounts of data while maintaining fast query times.
- Indexing structures like hierarchical navigable small world (HNSW) or inverted file (IVF) are commonly used to achieve this.

**Inputs** 
- Vector embeddings

**Process** 
- Build an index structure to organize vectors for efficient retrieval

**Outputs** 
- Indexed vector database

Here's a simple example using the FAISS library to create an index:

In [6]:
import numpy as np
import faiss

# Sample vector data
dimension = 128
num_vectors = 10000
vectors = np.random.random((num_vectors, dimension)).astype('float32')

# Create an index
index = faiss.IndexFlatL2(dimension)

# Add vectors to the index
index.add(vectors)

print(f"Total vectors indexed: {index.ntotal}")

Total vectors indexed: 10000


## Metadata Management

- Metadata management is an essential aspect of building a vector database that is often overlooked. 
- It involves storing and organizing additional information about the vectors, such as their source, creation date, or associated labels.
- Proper metadata management enhances the usability and interpretability of the vector database.

**Inputs**
- Vector embeddings, associated metadata

**Process** 
- Store metadata alongside vectors, create efficient retrieval mechanisms

**Outputs**
- Indexed vectors with linked metadata

Here's a conceptual example of how to manage metadata using a simple dictionary:

In [7]:
import uuid

class VectorWithMetadata:
    def __init__(self, vector, metadata):
        self.id = str(uuid.uuid4())
        self.vector = vector
        self.metadata = metadata

# Create a dictionary to store vectors with metadata
vector_store = {}

# Add vectors with metadata
vector1 = np.random.random(dimension).astype('float32')
metadata1 = {"source": "document1.txt", "category": "finance"}
vector_with_metadata1 = VectorWithMetadata(vector1, metadata1)
vector_store[vector_with_metadata1.id] = vector_with_metadata1

# Retrieve vector and metadata
retrieved_vector = vector_store[vector_with_metadata1.id]
print(f"Retrieved vector: {retrieved_vector.vector[:5]}")
print(f"Retrieved metadata: {retrieved_vector.metadata}")

Retrieved vector: [0.85523206 0.81125873 0.02158552 0.42390928 0.8991076 ]
Retrieved metadata: {'source': 'document1.txt', 'category': 'finance'}


## Similarity Search Implementation

- Similarity search is the core operation in vector databases. 
- It involves finding the most similar vectors to a given query vector based on a distance metric (e.g., Euclidean distance, cosine similarity). 
- This operation is fundamental for various applications such as recommendation systems, image retrieval, and semantic text search.

**Inputs** 
- Query vector, indexed vector database
- Process: Search the index for the nearest neighbors of the query vector
- Outputs: List of similar vectors and their distances/similarities

Here's an example of performing a similarity search using the previously created FAISS index:

In [8]:
# Perform a similarity search
k = 5  # Number of nearest neighbors to retrieve
query_vector = np.random.random((1, dimension)).astype('float32')

distances, indices = index.search(query_vector, k)

print(f"Indices of {k} nearest neighbors: {indices[0]}")
print(f"Distances to {k} nearest neighbors: {distances[0]}")

Indices of 5 nearest neighbors: [5086 2470 1156 3986 5690]
Distances to 5 nearest neighbors: [14.633185  14.765137  15.56501   15.658556  15.7717905]


# Let's Build a Real world project to understand the concept of Vector Databases better

# Semantic Search Engine using Vector Database


## Problem Description

We aim to build a semantic search engine using a vector database to enable efficient and meaningful search across a large corpus of text documents. This project will demonstrate the effectiveness of vector databases in capturing semantic relationships between documents and enabling fast, relevant search results based on the meaning of queries rather than just keyword matching.

## Dataset Description

- The Wikipedia Articles dataset consists of 100,000 Wikipedia articles, split into 90,000 for indexing and 10,000 for testing queries.
- Each article contains the title, full text content, and associated categories.
- The dataset provides a diverse range of topics and writing styles, ideal for testing semantic search capabilities.
- Key features of the dataset:
  - 100,000 Wikipedia articles (90,000 for indexing, 10,000 for testing)
  - Rich text data including titles, full content, and categories
  - Wide variety of topics covering general knowledge
  - Varied article lengths and complexities
- For more information about the Wikipedia Articles dataset, you can visit the following link: [Wikipedia Articles Dataset](https://huggingface.co/datasets/wikipedia)

In [2]:
import numpy as np
import pandas as pd
from sentence_transformers import SentenceTransformer
import faiss

## Loading and Preprocessing the Wikipedia Dataset

We use the Hugging Face datasets library to load the Wikipedia dataset. The with_info=True parameter returns dataset info along with the dataset itself.

In [3]:
from datasets import load_dataset

# Load the Wikipedia dataset
dataset = load_dataset("wikipedia", "20220301.en", split="train[:100000]")

# Convert to pandas DataFrame for easier handling
df = pd.DataFrame(dataset)

# Preprocess the text (combine title and text)
df['combined_text'] = df['title'] + " " + df['text']

## Creating Vector Embeddings

We use the SentenceTransformer model to generate vector embeddings for each article's combined text.

In [4]:
model = SentenceTransformer('all-MiniLM-L6-v2')

# Generate embeddings
embeddings = model.encode(df['combined_text'].tolist(), show_progress_bar=True)

Batches: 100%|██████████| 3125/3125 [09:10<00:00,  5.68it/s]


## Building the Vector Database

We use FAISS to create an efficient index for our vector embeddings, enabling fast similarity search.

In [5]:
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)

# Add vectors to the index
index.add(embeddings.astype('float32'))

print(f"Total vectors indexed: {index.ntotal}")

Total vectors indexed: 100000


## Implementing Semantic Search

Now we can perform semantic searches using our vector database.

In [6]:
def semantic_search(query, top_k=5):
    # Encode the query
    query_vector = model.encode([query])[0].astype('float32').reshape(1, -1)
    
    # Perform the search
    distances, indices = index.search(query_vector, top_k)
    
    # Return results
    results = []
    for i, idx in enumerate(indices[0]):
        results.append({
            'title': df.iloc[idx]['title'],
            'text': df.iloc[idx]['text'][:200] + "...",  # Preview of text
            'distance': distances[0][i]
        })
    
    return results

In [7]:
# Example search
search_results = semantic_search("quantum computing applications")
for result in search_results:
    print(f"Title: {result['title']}")
    print(f"Preview: {result['text']}")
    print(f"Distance: {result['distance']}")
    print("---")

Title: Quantum computing
Preview: Quantum computing is a type of computation that harnesses the collective properties of quantum states, such as superposition, interference, and entanglement, to perform calculations. The devices that ...
Distance: 0.7054274082183838
---
Title: Applications of quantum mechanics
Preview: Quantum physics is a branch of modern physics in which energy and matter are described at their most fundamental level, that of energy quanta, elementary particles, and quantum fields. Quantum physics...
Distance: 0.9135433435440063
---
Title: Quantum key distribution
Preview: Quantum key distribution (QKD) is a secure communication method which implements a cryptographic protocol involving components of quantum mechanics. It enables two parties to produce a shared random s...
Distance: 0.9649192094802856
---
Title: Shor's algorithm
Preview: Shor's algorithm is a quantum computer algorithm for finding the prime factors of an integer. It was discovered in 1994 by the Amer