# Vector DB comparision

## [Vector DB Dashboard](https://superlinked.com/vector-db-comparison) -> Click to get more information

|       **Feature**      |      **FAISS**      |                        **ChromaDB**                       |      **Pinecone**      |                          **Weaviate**                          |                    **Qdrant**                    |            LanceDB            |
|:----------------------:|:-------------------:|:---------------------------------------------------------:|:----------------------:|:--------------------------------------------------------------:|:------------------------------------------------:|:-----------------------------:|
| **Type**               | Library             | Library + DB                                              | Managed Cloud DB       | Managed DB + OSS                                               | Managed DB + OSS                                 | Managed DB + OSS              |
| **Hosting**            | Local               | Local (some cloud support)                                | Cloud-only             | Cloud & Self-host                                              | Cloud & Self-host                                | Cloud & Self-host             |
| **Persistant Storage** | No (in-memory)      | Yes                                                       | Yes                    | Yes                                                            | Yes                                              | Yes (s3)                      |
| **Scalability**        | Manual              | Limited                                                   | Auto-scaling           | Auto-scaling                                                   | Auto-scaling                                     | Auto-scaling                  |
| **API Access**         | No REST API         | Python API                                                | REST/gRPC API          | REST/gRPC API                                                  | REST/gRPC API                                    | Pandas Style API              |
| **Indexing Option**    | IVF, HNSW, PQ, Flat | HNSW, SPANN                                               | Proprietery            | HNSW, Flat, IVF, Flat-BQ                                       | HNSW, IVF, PQ                                    | HNSW, IVF, PQ                 |
| **BM-25**              | No                  | No                                                        | No                     | Yes                                                            | No                                               | Yes                           |
| **Hybrid search**      | No                  | No                                                        | Yes                    | Yes                                                            | Yes                                              | Yes                           |
| **Metadata Filtering** | Manual              | Built-in                                                  | Built-in               | Built-in                                                       | Built-in                                         | Built-in (limited)            |
| **Replication**        | No                  | No                                                        | Yes                    | Yes                                                            | Yes                                              | NA                            |
| **Embeddings Storage** | Vectors only        | Vectors + Metadata + Documents                            | Vectors + Metadata     | Vectors + Metadata + Schema                                    | Vectors +  Payload (Custom  metadata)            | tabular + vector              |
| **Integrations**       | Custom              | Langchain, Lllamaindex                                    | Langchain, Lllamaindex | Langchain, Lllamaindex                                         | Langchain, Lllamaindex                           | Langchain, Lllamaindex        |
| **License**            | MIT                 | Apache 2.0                                                | Proprietery SaaS       | BSD                                                            | Apache 2.0                                       | Apache 2.0                    |
| **Best For**           | Fast local search   | Embedding + metadata                                      | Production Saas        | Knowledge graph &  vector search                               | Vector Search  with filtering                    | Tabular data +  Vector search |
| **Docker image**       | No                  | [chromadb](https://hub.docker.com/r/chromadb/chroma/tags) | NA                     | [weaviate](https://hub.docker.com/r/semitechnologies/weaviate) | [qdrant](https://hub.docker.com/r/qdrant/qdrant) | NA                            |
| **Dev Language**       | C++                 | rust                                                      | rust                   | go                                                             | rust                                             | rust                          |
| **Multi-tenant**       | No                  | Yes                                                       | Yes (via namespace)    | Yes                                                            | Yes (via collection/metadata)                    | No                            |

## Comparing different vector db

### Step 1 - Prepare test data

In [1]:
sentences = [
	"Artificial intelligence is transforming modern healthcare through diagnostic tools.",
	"Machine learning algorithms can predict patient outcomes with high accuracy.",
	"Neural networks require large amounts of training data to be effective.",
	"Cloud computing enables scalable AI deployment across industries.",
	"Natural language processing allows computers to understand human speech.",
	"Deep learning models excel at image recognition tasks.",
	"Data privacy remains a major concern in AI implementation.",
	"Quantum computing promises to revolutionize complex calculations.",
	"Robotics automation is changing manufacturing processes worldwide.",
	"Computer vision systems can identify objects in real-time video.",
	"Ethical AI development requires careful consideration of bias.",
	"The Internet of Things connects everyday devices to the cloud.",
	"Blockchain technology provides secure decentralized transactions.",
	"5G networks enable faster data transfer for mobile applications.",
	"Virtual reality creates immersive digital experiences for users.",
	"Cybersecurity threats continue to evolve with advancing technology.",
	"Big data analytics helps businesses make informed decisions.",
	"Autonomous vehicles use sensors and AI to navigate roads safely.",
	"Edge computing processes data closer to the source for reduced latency.",
	"Augmented reality overlays digital information onto the real world.",
	"Python is the most popular programming language for data science.",
	"Reinforcement learning allows AI to learn through trial and error.",
	"Semiconductor chips are essential components in all computing devices.",
	"Digital transformation affects every industry in the modern economy.",
	"AI ethics committees are being formed to guide responsible development.",
]

In [2]:
queries = [
	"AI in healthcare",
	"Machine learning applications",
	"Technology security concerns",
	"Neural networks and data requirements",
	"Real-time computer vision systems",
]

In [3]:
import os
from dotenv import load_dotenv, find_dotenv

load_dotenv(find_dotenv())
EURI_API_KEY = os.getenv("EURI_API_KEY")

In [4]:
import requests
import numpy as np

In [5]:
def generate_embeddings(text: str) -> np.ndarray:
	url = "https://api.euron.one/api/v1/euri/embeddings"
	headers = {
		"Content-Type": "application/json",
		"Authorization": f"Bearer {EURI_API_KEY}",
	}
	payload = {"input": text, "model": "text-embedding-3-small"}

	response = requests.post(url, headers=headers, json=payload)
	data = response.json()

	embedding = np.array(data["data"][0]["embedding"], dtype=np.float32)

	return embedding

In [6]:
text = "The weather is sunny today."

embedding = generate_embeddings(text)

In [7]:
print(f"embedding shape: {embedding.shape} and embedding type: {embedding.dtype}")

embedding shape: (1536,) and embedding type: float32


In [8]:
embeddings = []
for i in sentences:
	emb = generate_embeddings(text=i)
	embeddings.append(emb)

In [9]:
embeddings

[array([ 0.0098142 , -0.02956349,  0.0170677 , ...,  0.00462685,
        -0.03077241,  0.02712368], shape=(1536,), dtype=float32),
 array([-0.00840364, -0.00392833,  0.03710642, ..., -0.00084534,
        -0.01359169, -0.00227219], shape=(1536,), dtype=float32),
 array([ 0.00685719,  0.01064172,  0.03272605, ..., -0.02078094,
         0.00418498,  0.01802759], shape=(1536,), dtype=float32),
 array([-0.00989157, -0.03057791,  0.04363609, ..., -0.01714974,
        -0.01270996,  0.04276554], shape=(1536,), dtype=float32),
 array([-0.03052658,  0.0204996 , -0.0008052 , ..., -0.00183322,
         0.02163397,  0.03516533], shape=(1536,), dtype=float32),
 array([ 0.01001777, -0.03060824, -0.0151873 , ...,  0.01738751,
         0.01140021,  0.0041327 ], shape=(1536,), dtype=float32),
 array([ 0.0315668 ,  0.00453559,  0.03619356, ..., -0.01535035,
         0.03108817,  0.00243731], shape=(1536,), dtype=float32),
 array([-0.0273714 ,  0.01015717, -0.01994542, ..., -0.02634794,
         0.0294897

In [10]:
len(embeddings[0])

1536

In [11]:
embeddings_array = np.vstack(embeddings)
embeddings_array.shape

(25, 1536)

In [12]:
dimension = embeddings_array.shape[1]
dimension

1536

### Running in FAISS

In [13]:
import faiss

In [14]:
index = faiss.IndexFlatL2(dimension)
index

<faiss.swigfaiss.IndexFlatL2; proxy of <Swig Object of type 'faiss::IndexFlatL2 *' at 0x114642640> >

In [15]:
index.add(embeddings_array)
index.ntotal

25

In [16]:
query = queries[0]
query

'AI in healthcare'

In [17]:
query_vec = generate_embeddings(text=query).reshape(1, -1)
query_vec.shape

(1, 1536)

In [18]:
distance, indices = index.search(query_vec, 2)

In [19]:
print(
	f"Query: {query}\n"
	f"Top 2 most similar sentences:\n"
	f"{sentences[indices[0][0]]}\n"
	f"{sentences[indices[0][1]]}\n"
)

Query: AI in healthcare
Top 2 most similar sentences:
Artificial intelligence is transforming modern healthcare through diagnostic tools.
Cloud computing enables scalable AI deployment across industries.



In [20]:
# save index to disk
faiss.write_index(index, "../data/faiss_index.faiss")

## Chroma DB

In [13]:
import chromadb

In [14]:
from chromadb.utils import embedding_functions

In [16]:
chroma_client = chromadb.Client()

In [17]:
chroma_collection = chroma_client.create_collection(name="chroma_db_test")

In [18]:
len(embeddings_array[0])

1536

In [20]:
chroma_collection.add(
    documents=sentences,
    embeddings=embeddings_array,
    ids=[f"rec_{i}" for i in range(len(sentences))]
)

In [21]:
chroma_collection.count()

25

In [22]:
chroma_collection.get()

{'ids': ['rec_0',
  'rec_1',
  'rec_2',
  'rec_3',
  'rec_4',
  'rec_5',
  'rec_6',
  'rec_7',
  'rec_8',
  'rec_9',
  'rec_10',
  'rec_11',
  'rec_12',
  'rec_13',
  'rec_14',
  'rec_15',
  'rec_16',
  'rec_17',
  'rec_18',
  'rec_19',
  'rec_20',
  'rec_21',
  'rec_22',
  'rec_23',
  'rec_24'],
 'embeddings': None,
 'documents': ['Artificial intelligence is transforming modern healthcare through diagnostic tools.',
  'Machine learning algorithms can predict patient outcomes with high accuracy.',
  'Neural networks require large amounts of training data to be effective.',
  'Cloud computing enables scalable AI deployment across industries.',
  'Natural language processing allows computers to understand human speech.',
  'Deep learning models excel at image recognition tasks.',
  'Data privacy remains a major concern in AI implementation.',
  'Quantum computing promises to revolutionize complex calculations.',
  'Robotics automation is changing manufacturing processes worldwide.',
  'C

In [24]:
query

'AI in healthcare'

In [27]:
query_vec = generate_embeddings(text=query).reshape(1, -1)

In [28]:
result = chroma_collection.query(query_embeddings=query_vec,n_results=2)

In [29]:
print(result)

{'ids': [['rec_0', 'rec_3']], 'embeddings': None, 'documents': [['Artificial intelligence is transforming modern healthcare through diagnostic tools.', 'Cloud computing enables scalable AI deployment across industries.']], 'uris': None, 'included': ['metadatas', 'documents', 'distances'], 'data': None, 'metadatas': [[None, None]], 'distances': [[0.7103437781333923, 0.9832593202590942]]}


In [34]:
def print_chroma_results_detailed(result, query_text=None):
    """
    Detailed formatting with metadata support.
    """
    if query_text:
        print(f"🔍 QUERY: '{query_text}'")
        print("=" * 80)
    
    documents = result['documents'][0]
    distances = result['distances'][0]
    ids = result['ids'][0]
    metadatas = result['metadatas'][0] if result['metadatas'] else [None] * len(documents)
    
    print(f"📊 Found {len(documents)} results (lower distance = more similar)\n")
    
    for i, (doc_id, document, distance, metadata) in enumerate(zip(ids, documents, distances, metadatas)):
        print(f"🏆 RANK {i+1} | Distance: {distance:.4f} | ID: {doc_id}")
        print(f"📄 Document: {document}")
        if metadata:
            print(f"📋 Metadata: {metadata}")
        print("-" * 80)

In [35]:
print_chroma_results_detailed(result, query_text=query)

🔍 QUERY: 'AI in healthcare'
📊 Found 2 results (lower distance = more similar)

🏆 RANK 1 | Distance: 0.7103 | ID: rec_0
📄 Document: Artificial intelligence is transforming modern healthcare through diagnostic tools.
--------------------------------------------------------------------------------
🏆 RANK 2 | Distance: 0.9833 | ID: rec_3
📄 Document: Cloud computing enables scalable AI deployment across industries.
--------------------------------------------------------------------------------
