# Vector DB comparision


## [Vector DB Dashboard](https://superlinked.com/vector-db-comparison) -> Click to get more information


|      **Feature**       |      **FAISS**      |                       **ChromaDB**                        |      **Pinecone**      |                          **Weaviate**                          |                    **Qdrant**                    |           LanceDB            |
| :--------------------: | :-----------------: | :-------------------------------------------------------: | :--------------------: | :------------------------------------------------------------: | :----------------------------------------------: | :--------------------------: |
|        **Type**        |       Library       |                       Library + DB                        |    Managed Cloud DB    |                        Managed DB + OSS                        |                 Managed DB + OSS                 |       Managed DB + OSS       |
|      **Hosting**       |        Local        |                Local (some cloud support)                 |       Cloud-only       |                       Cloud & Self-host                        |                Cloud & Self-host                 |      Cloud & Self-host       |
| **Persistant Storage** |   No (in-memory)    |                            Yes                            |          Yes           |                              Yes                               |                       Yes                        |           Yes (s3)           |
|    **Scalability**     |       Manual        |                          Limited                          |      Auto-scaling      |                          Auto-scaling                          |                   Auto-scaling                   |         Auto-scaling         |
|     **API Access**     |     No REST API     |                        Python API                         |     REST/gRPC API      |                         REST/gRPC API                          |                  REST/gRPC API                   |       Pandas Style API       |
|  **Indexing Option**   | IVF, HNSW, PQ, Flat |                        HNSW, SPANN                        |      Proprietery       |                    HNSW, Flat, IVF, Flat-BQ                    |                  HNSW, IVF, PQ                   |        HNSW, IVF, PQ         |
|       **BM-25**        |         No          |                            No                             |           No           |                              Yes                               |                        No                        |             Yes              |
|   **Hybrid search**    |         No          |                            No                             |          Yes           |                              Yes                               |                       Yes                        |             Yes              |
| **Metadata Filtering** |       Manual        |                         Built-in                          |        Built-in        |                            Built-in                            |                     Built-in                     |      Built-in (limited)      |
|    **Replication**     |         No          |                            No                             |          Yes           |                              Yes                               |                       Yes                        |              NA              |
| **Embeddings Storage** |    Vectors only     |              Vectors + Metadata + Documents               |   Vectors + Metadata   |                  Vectors + Metadata + Schema                   |       Vectors + Payload (Custom metadata)        |       tabular + vector       |
|    **Integrations**    |       Custom        |                  Langchain, Lllamaindex                   | Langchain, Lllamaindex |                     Langchain, Lllamaindex                     |              Langchain, Lllamaindex              |    Langchain, Lllamaindex    |
|      **License**       |         MIT         |                        Apache 2.0                         |    Proprietery SaaS    |                              BSD                               |                    Apache 2.0                    |          Apache 2.0          |
|      **Best For**      |  Fast local search  |                   Embedding + metadata                    |    Production Saas     |                Knowledge graph & vector search                 |           Vector Search with filtering           | Tabular data + Vector search |
|    **Docker image**    |         No          | [chromadb](https://hub.docker.com/r/chromadb/chroma/tags) |           NA           | [weaviate](https://hub.docker.com/r/semitechnologies/weaviate) | [qdrant](https://hub.docker.com/r/qdrant/qdrant) |              NA              |
|    **Dev Language**    |         C++         |                           rust                            |          rust          |                               go                               |                       rust                       |             rust             |
|    **Multi-tenant**    |         No          |                            Yes                            |  Yes (via namespace)   |                              Yes                               |          Yes (via collection/metadata)           |              No              |


## Comparing different vector db


### Step 1 - Prepare test data


In [None]:
from __future__ import annotations

sentences = [
	"Artificial intelligence is transforming modern healthcare through diagnostic \
	tools.",
	"Machine learning algorithms can predict patient outcomes with high accuracy.",
	"Neural networks require large amounts of training data to be effective.",
	"Cloud computing enables scalable AI deployment across industries.",
	"Natural language processing allows computers to understand human speech.",
	"Deep learning models excel at image recognition tasks.",
	"Data privacy remains a major concern in AI implementation.",
	"Quantum computing promises to revolutionize complex calculations.",
	"Robotics automation is changing manufacturing processes worldwide.",
	"Computer vision systems can identify objects in real-time video.",
	"Ethical AI development requires careful consideration of bias.",
	"The Internet of Things connects everyday devices to the cloud.",
	"Blockchain technology provides secure decentralized transactions.",
	"5G networks enable faster data transfer for mobile applications.",
	"Virtual reality creates immersive digital experiences for users.",
	"Cybersecurity threats continue to evolve with advancing technology.",
	"Big data analytics helps businesses make informed decisions.",
	"Autonomous vehicles use sensors and AI to navigate roads safely.",
	"Edge computing processes data closer to the source for reduced latency.",
	"Augmented reality overlays digital information onto the real world.",
	"Python is the most popular programming language for data science.",
	"Reinforcement learning allows AI to learn through trial and error.",
	"Semiconductor chips are essential components in all computing devices.",
	"Digital transformation affects every industry in the modern economy.",
	"AI ethics committees are being formed to guide responsible development.",
]

In [None]:
queries = [
	"AI in healthcare",
	"Machine learning applications",
	"Technology security concerns",
	"Neural networks and data requirements",
	"Real-time computer vision systems",
]

In [None]:
import os
from datetime import UTC, datetime

from dotenv import find_dotenv, load_dotenv

# load environment variables from a .env file (if present)
load_dotenv(find_dotenv())

EURI_API_KEY = os.getenv("EURI_API_KEY")
PINECONE_API_KEY = os.getenv("PINECONE_API_KEY")
WEAVIATE_API_KEY = os.getenv("WEAVIATE_API_KEY")
WEAVIATE_REST_END_POINT = os.getenv("WEAVIATE_REST_ENDPOINT")
QDRANT_API_KEY = os.getenv("QDRANT_API_KEY")
QDRANT_URL = os.getenv("QDRANT_URL")

In [None]:
import numpy as np
import requests

In [None]:
def generate_embeddings(text: str) -> np.ndarray:
	url = "https://api.euron.one/api/v1/euri/embeddings"
	headers = {
		"Content-Type": "application/json",
		"Authorization": f"Bearer {EURI_API_KEY}",
	}
	payload = {"input": text, "model": "text-embedding-3-small"}

	response = requests.post(url, headers=headers, json=payload, timeout=30)
	data = response.json()

	return np.array(data["data"][0]["embedding"], dtype=np.float32)

In [None]:
text = "The weather is sunny today."

embedding = generate_embeddings(text)

In [None]:
print(f"embedding shape: {embedding.shape} and embedding type: {embedding.dtype}")

In [None]:
embeddings = []
for i in sentences:
	emb = generate_embeddings(text=i)
	embeddings.append(emb)

In [None]:
embeddings

In [None]:
len(embeddings[0])

In [None]:
embeddings_array = np.vstack(embeddings)
embeddings_array.shape

In [None]:
dimension = embeddings_array.shape[1]
dimension

### Running in FAISS


In [None]:
import faiss

In [None]:
index = faiss.IndexFlatL2(dimension)
index

In [None]:
index.add(embeddings_array)
index.ntotal

In [None]:
query = queries[0]
query

In [None]:
query_vec = generate_embeddings(text=query).reshape(1, -1)
query_vec.shape

In [None]:
distance, indices = index.search(query_vec, 2)

In [None]:
print(
	f"Query: {query}\n"
	f"Top 2 most similar sentences:\n"
	f"{sentences[indices[0][0]]}\n"
	f"{sentences[indices[0][1]]}\n"
)

In [None]:
# save index to disk
faiss.write_index(index, "../data/faiss_index.faiss")

## Chroma DB


In [None]:
import chromadb

In [None]:
import chromadb.config

In [None]:
chroma_client = chromadb.PersistentClient(path="../data/chroma_db")

In [None]:
collection_list = chroma_client.list_collections()
if len(collection_list) > 0:
	print(collection_list)

In [None]:
chroma_client.delete_collection(name="chroma_db_test")

In [None]:
chroma_collection = chroma_client.create_collection(
	name="chroma_db_test",
)

In [None]:
len(embeddings_array[0])

In [None]:
chroma_collection.add(
	documents=sentences,
	embeddings=embeddings_array,
	ids=[f"rec_{i}" for i in range(len(sentences))],
)

In [None]:
chroma_collection.count()

In [None]:
chroma_collection.get()

In [None]:
query

In [None]:
query_vec = generate_embeddings(text=query).reshape(1, -1)

In [None]:
result = chroma_collection.query(query_embeddings=query_vec, n_results=2)

In [None]:
print(result)

In [None]:
def print_chroma_results_detailed(result, query_text=None):
	"""
	Detailed formatting with metadata support.
	"""
	if query_text:
		print(f"🔍 QUERY: '{query_text}'")
		print("=" * 80)

	documents = result["documents"][0]
	distances = result["distances"][0]
	ids = result["ids"][0]
	metadatas = (
		result["metadatas"][0] if result["metadatas"] else [None] * len(documents)
	)

	print(f"📊 Found {len(documents)} results (lower distance = more similar)\n")

	for i, (doc_id, document, distance, metadata) in enumerate(
		zip(ids, documents, distances, metadatas, strict=False)
	):
		print(f"🏆 RANK {i + 1} | Distance: {distance:.4f} | ID: {doc_id}")
		print(f"📄 Document: {document}")
		if metadata:
			print(f"📋 Metadata: {metadata}")
		print("-" * 80)

In [None]:
print_chroma_results_detailed(result, query_text=query)

## Pinecone DB


In [None]:
from pinecone import Pinecone

In [None]:
pc = Pinecone(api_key=PINECONE_API_KEY)

In [None]:
index = pc.Index(
	name="compare-vector-db"
)  # created manually directly in the pinecone with 1536 dimension

In [None]:
index.upsert(vectors=[("0", embeddings_array[0].tolist(), {"text": sentences[0]})])

### Creating pinecone compatible record structure with custom metadata


In [None]:
records = []
for i, (text, emb) in enumerate(zip(sentences, embeddings_array, strict=False)):
	records.append(
		(
			f"id_{i}",
			emb.tolist(),
			{
				"text": text,
				"inserted_on": str(datetime.now(UTC)),
				"inserted_by": "batch_python",
			},
		)
	)

In [None]:
records[0]

In [None]:
index.upsert(vectors=records)

In [None]:
index.describe_index_stats()

In [None]:
index.fetch(ids=["id_17"])

In [None]:
query

In [None]:
query = queries[0]
query

## Please notice we need to change query vector output to list to be used with pinecone


In [None]:
query_vec = generate_embeddings(text=query).reshape(1, -1).tolist()

In [None]:
query_vec

In [None]:
result = index.query(vector=query_vec, top_k=2, include_metadata=True)

In [None]:
def analyze_pinecone_results(result, query_text=None, show_metadata=True):
	"""
	Advanced Pinecone results parser with filtering options.
	"""
	if query_text:
		print(f"🔍 QUERY: '{query_text}'")
		print("=" * 80)

	matches = result.get("matches", [])
	namespace = result.get("namespace", "default")

	if not matches:
		print("❌ No matches found.")
		return

	print(f"📊 Found {len(matches)} results in namespace '{namespace}'")
	print("💡 Higher score = more similar (cosine similarity)\n")

	for i, match in enumerate(matches):
		score = match.get("score", 0)
		doc_id = match.get("id", "N/A")
		metadata = match.get("metadata", {})
		text = metadata.get("text", "No text available")

		# Calculate similarity percentage
		similarity_pct = score * 100

		# Visual score indicator
		score_bar = "█" * int(score * 20) + "░" * (20 - int(score * 20))

		print(f"🏆 RANK {i + 1}")
		print(f"   📍 ID: {doc_id}")
		print(f"   📈 Score: {score:.4f} ({similarity_pct:.1f}%)")
		print(f"   📊 {score_bar}")
		print(f"   📄 Text: {text}")

		if show_metadata and metadata:
			# Exclude text from metadata display since we already show it
			other_metadata = {k: v for k, v in metadata.items() if k != "text"}
			if other_metadata:
				print("   📋 Metadata:")
				for key, value in other_metadata.items():
					print(f"      • {key}: {value}")

		print("-" * 80)

In [None]:
analyze_pinecone_results(result, query, show_metadata=True)

## WEAVIATE DB


In [None]:
import weaviate
from weaviate.classes.init import Auth

In [None]:
client = weaviate.connect_to_weaviate_cloud(
	cluster_url=WEAVIATE_REST_END_POINT,
	auth_credentials=Auth.api_key(WEAVIATE_API_KEY),
)

In [None]:
print(client.is_ready())

In [None]:
import weaviate.classes as wvc

In [None]:
weaviate_colleciton = client.collections.create(
	name="vectordb_compare",
	vector_config=wvc.config.Configure.Vectors.self_provided(),
)

### creating record structure expected by weaviate, also there is default metadata created on insert


In [None]:
records = []
for _i, (text, emb) in enumerate(zip(sentences, embeddings_array, strict=False)):
	records.append(
		wvc.data.DataObject(
			properties={
				"text": text,
				"inserted_on": str(datetime.now(UTC)),
				"inserted_by": "batch_python",
			},
			vector=emb.tolist(),
		)
	)

In [None]:
weaviate_colleciton.data.insert_many(records)

In [None]:
query = queries[0]
query

In [None]:
query_vec = generate_embeddings(text=query).reshape(1, -1).tolist()

In [None]:
query_vec[0]

In [None]:
response = weaviate_colleciton.query.near_vector(
	near_vector=query_vec[0],
	limit=2,
	return_metadata=wvc.query.MetadataQuery(certainty=True, distance=True),
)

In [None]:
for o in response.objects:
	print(o.properties)
	print(o.metadata.distance)

## QDRANT DB


In [None]:
from qdrant_client import QdrantClient

In [None]:
qdrant_client = QdrantClient(url=QDRANT_URL, api_key=QDRANT_API_KEY)

In [None]:
qdrant_collection = "vector-compare"
if qdrant_client.collection_exists(collection_name=qdrant_collection):
	qdrant_client.delete_collection(collection_name=qdrant_collection)

In [None]:
from qdrant_client.models import Distance, VectorParams

qdrant_client.create_collection(
	collection_name=qdrant_collection,
	vectors_config=VectorParams(size=dimension, distance=Distance.COSINE),
)

In [None]:
from qdrant_client.models import PointStruct

qdrant_client.upsert(
	collection_name=qdrant_collection,
	points=[
		PointStruct(
			id=i,
			vector=emb.tolist(),
			payload={
				"text": text,
				"inserted_on": str(datetime.now(UTC)),
				"inserted_by": "batch_python",
			},
		)
		for i, (text, emb) in enumerate(zip(sentences, embeddings_array, strict=False))
	],
)

In [None]:
query = queries[0]
query

In [None]:
query_vec = generate_embeddings(text=query).reshape(1, -1).tolist()

In [None]:
query_vec[0]

In [None]:
response = qdrant_client.query_points(
	collection_name=qdrant_collection,
	query=query_vec[0],
	limit=2,  # Return 2 closest points
	with_payload=True,
)
response

In [None]:
def format_qdrant_response(response):
	"""
	Parse and format Qdrant search response for user display

	Args:
	response: QueryResponse object from Qdrant client search

	Returns:
	None (prints formatted output)
	"""
	print(f"Found {len(response.points)} results:\n")

	for i, point in enumerate(response.points, 1):
		print(f"Result {i}:")
		print(f"  ID: {point.id}")
		print(f"  Score: {point.score:.4f}")
		print(f"  Text: {point.payload['text']}")
		print(f"  Inserted on: {point.payload['inserted_on']}")
		print(f"  Inserted by: {point.payload['inserted_by']}")
		print("-" * 50)

In [None]:
format_qdrant_response(response)