# Vector Databases: Storing and Retrieving Embeddings

**Project Name:** Vector Databases: Storing and Retrieving Embeddings  
**Date:** February 10th, Day 8

---

## 1. Introduction and Objective

**Introduction:**  
Modern AI applications, especially those involving large language models (LLMs), often rely on embeddings—numerical representations of text or other data. However, as the number of embeddings grows (sometimes into the millions), it becomes crucial to have efficient methods to store, index, and retrieve these high-dimensional vectors. This is where vector databases come in. They allow for fast similarity search using metrics like cosine similarity or Euclidean distance, making them essential for applications such as semantic search, retrieval-augmented generation (RAG), and recommendation systems.

**Objective:**  
- Understand what vector databases are and why they are needed in AI applications.
- Explore use cases such as semantic search and RAG.
- Introduce popular vector database tools like Pinecone and Weaviate.
- Set up a vector database (using Pinecone in our example).
- Demonstrate storing and retrieving embeddings using a sample dataset.

---

## 2. Metadata

- **Dataset:** (For demonstration, we will create a small set of sample texts to generate embeddings.)
- **Technologies:** Python, Hugging Face Transformers, Pinecone (vector database service), NumPy, and Matplotlib.
- **Tools:** Pinecone (or Weaviate can be used similarly), a pre-trained embedding model (e.g., Sentence Transformers).
- **Environment:** Jupyter Notebook / Google Colab (CPU-friendly configuration)
- **Applications:** Semantic search, retrieval-augmented generation (RAG), recommendation systems, and more.

---

## 3. Conceptual Overview

### 3.1 What are Vector Databases?

Vector databases are specialized data stores designed to handle high-dimensional vectors (embeddings). They are optimized for performing similarity search operations rapidly by using approximate nearest neighbor (ANN) algorithms. Instead of traditional relational databases that index data by exact matching, vector databases index and retrieve data based on the closeness of vector representations.

### 3.2 Use Cases

- **Semantic Search:**  
  Retrieve documents or data that are semantically similar to a given query based on vector similarity.
- **Retrieval-Augmented Generation (RAG):**  
  Combine language models with retrieval systems to provide relevant context for generating more accurate responses.
- **Recommendation Systems:**  
  Suggest items (e.g., products, movies) based on the similarity of their embeddings.
- **Anomaly Detection:**  
  Identify unusual data points by comparing their embeddings to those of typical data.

### 3.3 Popular Tools

- **Pinecone:**  
  A managed vector database service that offers scalable and fast vector search capabilities.
- **Weaviate:**  
  An open-source vector search engine that also includes features like GraphQL-based querying.
- **Milvus:**  
  Another open-source vector database optimized for similarity search.

### 3.4 Mathematical Intuition

Embeddings are points in a high-dimensional space. The similarity between two embeddings can be measured using:
- **Cosine Similarity:**  
  \[
  \text{cosine\_similarity}(a, b) = \frac{a \cdot b}{\|a\| \|b\|}
  \]
- **Euclidean Distance:**  
  \[
  \text{euclidean\_distance}(a, b) = \sqrt{\sum_{i=1}^{d}(a_i - b_i)^2}
  \]
Vector databases use these metrics to quickly retrieve the most similar vectors, even among millions of entries, by leveraging efficient indexing structures.

### 3.5 Advantages and Disadvantages

**Advantages:**
- **Scalability:**  
  Can efficiently handle millions of high-dimensional vectors.
- **Speed:**  
  Optimized for rapid similarity searches, crucial for real-time applications.
- **Flexibility:**  
  Supports a wide range of AI applications—from semantic search to recommendations.

**Disadvantages:**
- **Complexity:**  
  Setting up and tuning a vector database may require domain-specific knowledge.
- **Cost:**  
  Managed services like Pinecone may incur costs as data scales.
- **Tuning Sensitivity:**  
  Performance may depend on hyperparameters such as the choice of similarity metric and indexing algorithm.

---

## 4. Implementation

In the following sections, we will demonstrate how to set up a vector database using Pinecone, store embeddings from a set of sample texts, and retrieve similar embeddings based on a query.


In [2]:
# Cell 1: Install and Import Libraries

!pip install pinecone-client transformers sentence-transformers --quiet

import pinecone
import numpy as np
import matplotlib.pyplot as plt
from transformers import AutoTokenizer, AutoModel

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.8/244.8 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.4/85.4 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[?25h

**Explanation for Cell 1:**  
In this cell, we install and import the necessary libraries:
- **pinecone-client:** To interact with the Pinecone vector database.
- **transformers and sentence-transformers:** To generate embeddings from text.
- **NumPy and Matplotlib:** For numerical operations and visualizations.
  
These libraries provide the tools needed for generating, storing, and retrieving embeddings.


In [1]:
pip install pinecone-client


Collecting pinecone-client
  Downloading pinecone_client-5.0.1-py3-none-any.whl.metadata (19 kB)
Collecting pinecone-plugin-inference<2.0.0,>=1.0.3 (from pinecone-client)
  Downloading pinecone_plugin_inference-1.1.0-py3-none-any.whl.metadata (2.2 kB)
Collecting pinecone-plugin-interface<0.0.8,>=0.0.7 (from pinecone-client)
  Downloading pinecone_plugin_interface-0.0.7-py3-none-any.whl.metadata (1.2 kB)
Downloading pinecone_client-5.0.1-py3-none-any.whl (244 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.8/244.8 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading pinecone_plugin_inference-1.1.0-py3-none-any.whl (85 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.4/85.4 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pinecone_plugin_interface-0.0.7-py3-none-any.whl (6.2 kB)
Installing collected packages: pinecone-plugin-interface, pinecone-plugin-inference, pinecone-client
Successfully installed pi

In [7]:
# Example vector and metadata
vector_id = "example-id"
vector_values = [0.1] * 1024  # Replace with actual embeddings
metadata = {"category": "example-category"}

# Upsert vectors into the index
index.upsert(vectors=[{"id": vector_id, "values": vector_values, "metadata": metadata}])


{'upserted_count': 1}

In [9]:
query_vector = [0.2] * 1024  # Replace with your query embedding
top_k = 5  # Number of closest matches to retrieve

# Query the index
# Correct query format
results = index.query(
    vector=query_vector,  # The vector to query against the index
    top_k=top_k,          # Number of nearest neighbors to retrieve
    include_metadata=True # Include metadata in results if desired
)

# Display results
print("Query Results:", results)


Query Results: {'matches': [{'id': 'id2', 'score': 0.999999881, 'values': []},
             {'id': 'example-id',
              'metadata': {'category': 'example-category'},
              'score': 0.999999881,
              'values': []},
             {'id': 'id1', 'score': 0.999999881, 'values': []}],
 'namespace': '',
 'usage': {'read_units': 6}}


In [10]:
query_vector = [0.1] * 1024  # Example query vector with 1024 dimensions
top_k = 5
results = index.query(vector=query_vector, top_k=top_k)
print(results)

{'matches': [{'id': 'id2', 'score': 0.999999881, 'values': []},
             {'id': 'example-id', 'score': 0.999999881, 'values': []},
             {'id': 'id1', 'score': 0.999999881, 'values': []}],
 'namespace': '',
 'usage': {'read_units': 5}}


In [13]:
from pinecone import Pinecone, ServerlessSpec
from sentence_transformers import SentenceTransformer

# Initialize Pinecone
api_key = "pcsk_3kPWqo_RDs5KaEVsr6rV6fJHbzLNczzsKJfAiTLuBaHDD5aStMdcziGwcCGong8e4nSx1Q"
pinecone_client = Pinecone(api_key=api_key)

# Define the index name
index_name = "multilingual-e5-large"

# Check if the index exists
if index_name not in pinecone_client.list_indexes().names():
    print(f"Index '{index_name}' not found. Please verify the index name in your Pinecone dashboard.")
else:
    print(f"Successfully connected to index: {index_name}")
    index = pinecone_client.Index(index_name)

# Load the Multilingual E5-Large model
model = SentenceTransformer("intfloat/e5-large-v2")

# Example data: restaurant descriptions
restaurant_data = [
    {"id": "1", "text": "Italian restaurant with great pasta and wine"},
    {"id": "2", "text": "Cozy cafe offering vegan-friendly desserts and coffee"},
    {"id": "3", "text": "Sushi bar with fresh seafood and authentic Japanese dishes"},
    {"id": "4", "text": "Indian restaurant with spicy curries and naan bread"},
    {"id": "5", "text": "Mexican restaurant with tacos, burritos, and margaritas"},
]

# Preprocess and encode data
def preprocess_query(query):
    return f"query: {query}"

vectors = [
    {"id": data["id"], "values": model.encode(preprocess_query(data["text"])).tolist(), "metadata": {"text": data["text"]}}
    for data in restaurant_data
]

# Upsert data into Pinecone
index.upsert(vectors=vectors)

# Querying the Pinecone index
query_text = "I want a place with fresh sushi"
query_vector = model.encode(preprocess_query(query_text)).tolist()

results = index.query(vector=query_vector, top_k=3, include_metadata=True)
print(f"Query: {query_text}\n")
print("Top matches:")
for match in results["matches"]:
    print(f"- {match['metadata']['text']} (score: {match['score']:.4f})")


Successfully connected to index: multilingual-e5-large


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Query: I want a place with fresh sushi

Top matches:
- Sushi bar with fresh seafood and authentic Japanese dishes (score: 0.8890)
- Italian restaurant with great pasta and wine (score: 0.7939)
- Indian restaurant with spicy curries and naan bread (score: 0.7864)
