
[Chroma DB User Guide](https://docs.trychroma.com/guides)

[Embeddings and Vector Databases With ChromaDB](https://realpython.com/chromadb-vector-database/) 

- Representing unstructured objects with vectors
- Using word and text embeddings in Python
- Harnessing the power of vector databases
- Encoding and querying over documents with ChromaDB
- Providing context to LLMs like ChatGPT with ChromaDB

[Ode to Joy](https://claude.ai/chat/3883912c-bae6-4f82-85de-2f09382f1c90) a fruitful chat session to be followed up

## Vector Basics


A better way to compute the dot product is to use the at-operator (@), which can perform both vector and matrix multiplications, and the syntax is cleaner.

In [6]:
import numpy as np

v1 = np.array([1, 0])
v2 = np.array([0, 1])
v3 = np.array([np.sqrt(2), np.sqrt(2)])

# Dimension
v1.shape
# (2,)

(2,)

In [7]:
# Magnitude
np.sqrt(np.sum(v1**2)) ,  np.linalg.norm(v1) ,  np.linalg.norm(v3)
# 1.0,  1.0, 2.0

(np.float64(1.0), np.float64(1.0), np.float64(2.0))

In [8]:
# Dot product
np.sum(v1 * v2)
# 0

np.int64(0)

In [9]:
v1 @ v2, v2 @ v3
# 1.4142135623730951

(np.int64(0), np.float64(1.4142135623730951))

## Vector Similarity

cosine similarity - a normalized form of the dot product. 

## Encode Objects in Embeddings

Embeddings are a way to represent data such as words, text, images, and audio in a numerical format that computational algorithms can more easily process.

More specifically, embeddings are dense vectors that characterize meaningful information about the objects that they encode. The most common kinds of embeddings are word and text embeddings, 



### Word Embeddings

A word embedding is a vector that captures the semantic meaning of word. Ideally, words that are semantically similar in natural language should have embeddings that are similar to each other in the encoded vector space. Analogously, words that are unrelated or opposite of one another should be further apart in the vector space. related words are clustered together, while unrelated words are far from each other.

```
conda create -n rag python=3.11
conda activate rag
python -m pip install spacy
python -m spacy download en_core_web_md
```

In [10]:
import numpy as np
import spacy
nlp = spacy.load("en_core_web_md")

dog_embedding = nlp.vocab["dog"].vector
type(dog_embedding),  dog_embedding.shape,  dog_embedding[0:3]
# (numpy.ndarray,  (300,), array([-0.72483 ,  0.42538 ,  0.025489], dtype=float32))

(numpy.ndarray,
 (300,),
 array([-0.72483 ,  0.42538 ,  0.025489], dtype=float32))

In [11]:
def compute_cosine_similarity(u: np.ndarray, v: np.ndarray) -> float:
    """Compute the cosine similarity between two vectors"""
    return (u @ v) / (np.linalg.norm(u) * np.linalg.norm(v))

In [12]:
dog_embedding = nlp.vocab["dog"].vector
cat_embedding = nlp.vocab["cat"].vector
apple_embedding = nlp.vocab["apple"].vector
tasty_embedding = nlp.vocab["tasty"].vector
delicious_embedding = nlp.vocab["delicious"].vector
truck_embedding = nlp.vocab["truck"].vector

In [13]:
compute_cosine_similarity(dog_embedding, cat_embedding)

np.float32(1.0000001)

In [22]:
compute_cosine_similarity(delicious_embedding, tasty_embedding)

np.float32(0.450864)

In [23]:
compute_cosine_similarity(apple_embedding, delicious_embedding)

np.float32(0.39558223)

In [24]:
compute_cosine_similarity(dog_embedding, apple_embedding)

np.float32(0.2334378)

In [25]:
compute_cosine_similarity(truck_embedding, delicious_embedding)

np.float32(0.036047027)

### Text Embeddings

Text embeddings encode information about sentences and documents, not just individual words, into vectors. This allows you to compare larger bodies of text to each other just like you did with word vectors. Because they encode more information than a single word embedding, text embeddings are a more powerful representation of information.

In [1]:
from sentence_transformers import SentenceTransformer

  from tqdm.autonotebook import tqdm, trange


In [2]:
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

In [3]:
# "all-MiniLM-L6-v2" encodes texts up to 256 words.
#  is one of the smallest pretrained models available, but it’s a great one to start with.
model = SentenceTransformer("all-MiniLM-L6-v2")

# model = SentenceTransformer("all-MiniLM-L6-v2", device="cpu")

  return torch._C._cuda_getDeviceCount() > 0


In [4]:
texts = [
         "The canine barked loudly.",
         "The dog made a noisy bark.",
         "He ate a lot of pizza.",
         "He devoured a large quantity of pizza pie.",
]

text_embeddings = model.encode(texts)

type(text_embeddings),  text_embeddings.shape
# <class 'numpy.ndarray'>  ,  (4, 384)

(numpy.ndarray, (4, 384))

In [14]:
text_embeddings_dict = dict(zip(texts, list(text_embeddings)))

In [16]:
dog_text_1 = "The canine barked loudly."
dog_text_2 = "The dog made a noisy bark."
compute_cosine_similarity(text_embeddings_dict[dog_text_1],
                          text_embeddings_dict[dog_text_2])

np.float32(0.77686167)

In [17]:
pizza_text_1 = "He ate a lot of pizza."
pizza_test_2 = "He devoured a large quantity of pizza pie."
compute_cosine_similarity(text_embeddings_dict[pizza_text_1],
                          text_embeddings_dict[pizza_test_2])

np.float32(0.78713405)

In [18]:
compute_cosine_similarity(text_embeddings_dict[dog_text_1],
                          text_embeddings_dict[pizza_text_1])

np.float32(0.09128279)

### Get Started With ChromaDB


core components of a vector database
- Embedding function:
- Similarity metric (cosine similarity, the dot product, or Euclidean distance)
- Indexing
- Metadata - context similar relational db attributes, useful to filter queries on metadata.
- CRUD operations: Most vector databases support create, read, update, and delete

```
pip install chromadb
pip install sentence_transformers
```


In [1]:
import chromadb
from chromadb.utils import embedding_functions

CHROMA_DATA_PATH = "chroma_data/"
EMBED_MODEL = "all-MiniLM-L6-v2"
COLLECTION_NAME = "demo_docs"

client = chromadb.PersistentClient(path=CHROMA_DATA_PATH)

In [None]:
embedding_func = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name=EMBED_MODEL
)

In [15]:
# create collection by name
collection = client.get_or_create_collection(
    name=COLLECTION_NAME,
    embedding_function=embedding_func,
    metadata={"hnsw:space": "cosine"},
)

In [17]:
collection, collection.name

(Collection(name=demo_docs), 'demo_docs')

In [18]:
client.list_collections()

[Collection(name=demo_docs)]

In [19]:
client.count_collections()

1

#### Adding data to collection

In [None]:
# add documents
documents = [
    "The latest iPhone model comes with impressive features and a powerful camera.",
    "Exploring the beautiful beaches and vibrant culture of Bali is a dream for many travelers.",
    "Einstein's theory of relativity revolutionized our understanding of space and time.",
    "Traditional Italian pizza is famous for its thin crust, fresh ingredients, and wood-fired ovens.",
    "The American Revolution had a profound impact on the birth of the United States as a nation.",
    "Regular exercise and a balanced diet are essential for maintaining good physical health.",
    "Leonardo da Vinci's Mona Lisa is considered one of the most iconic paintings in art history.",
    "Climate change poses a significant threat to the planet's ecosystems and biodiversity.",
    "Startup companies often face challenges in securing funding and scaling their operations.",
    "Beethoven's Symphony No. 9 is celebrated for its powerful choral finale, 'Ode to Joy.'",
]

genres = [
    "technology",
    "travel",
    "science",
    "food",
    "history",
    "fitness",
    "art",
    "climate change",
    "business",
    "music",
]

collection.add(
    documents=documents,
    ids=[f"id{i}" for i in range(len(documents))],
    metadatas=[{"genre": g} for g in genres]       # useful for filtering
)

In [20]:
data = collection.get()

In [21]:
data

{'ids': ['id0', 'id1', 'id2', 'id3', 'id4', 'id5', 'id6', 'id7', 'id8', 'id9'],
 'embeddings': None,
 'documents': ['The latest iPhone model comes with impressive features and a powerful camera.',
  'Exploring the beautiful beaches and vibrant culture of Bali is a dream for many travelers.',
  "Einstein's theory of relativity revolutionized our understanding of space and time.",
  'Traditional Italian pizza is famous for its thin crust, fresh ingredients, and wood-fired ovens.',
  'The American Revolution had a profound impact on the birth of the United States as a nation.',
  'Regular exercise and a balanced diet are essential for maintaining good physical health.',
  "Leonardo da Vinci's Mona Lisa is considered one of the most iconic paintings in art history.",
  "Climate change poses a significant threat to the planet's ecosystems and biodiversity.",
  'Startup companies often face challenges in securing funding and scaling their operations.',
  "Beethoven's Symphony No. 9 is celebr

#### query doc

In [None]:
query_results = collection.query(
    query_texts=["Find me some delicious food!"],
    n_results=1,
)

print(query_results.keys())
# dict_keys(['ids', 'distances', 'metadatas', 'embeddings', 'documents'])

print(query_results["documents"])
# [['Traditional Italian pizza is famous for its thin crust, fresh ingredients, and wood-fired ovens.']]

query_results["ids"], query_results["distances"], query_results["metadatas"]
# [['id3']], [[0.7638263782124082]], [[{'genre': 'food'}]]

dict_keys(['ids', 'embeddings', 'documents', 'uris', 'data', 'metadatas', 'distances', 'included'])
[['Traditional Italian pizza is famous for its thin crust, fresh ingredients, and wood-fired ovens.']]


([['id3']], [[0.7638265181530919]], [[{'genre': 'food'}]])

In [5]:
query_results = collection.query(
    query_texts=["Teach me about history",
                 "What's going on in the world?"],
    include=["documents", "distances"],
    n_results=2
)

In [6]:
query_results["documents"][0], query_results["documents"][1]

(["Einstein's theory of relativity revolutionized our understanding of space and time.",
  'The American Revolution had a profound impact on the birth of the United States as a nation.'],
 ["Climate change poses a significant threat to the planet's ecosystems and biodiversity.",
  "Einstein's theory of relativity revolutionized our understanding of space and time."])

In [7]:
query_results["distances"][0], query_results["distances"][1]

([0.6265882513801179, 0.6904193174467044],
 [0.800294374436915, 0.8882107236562582])

In [8]:
collection.query(
    query_texts=["Teach me about music history"],
    n_results=1
)

{'ids': [['id2']],
 'embeddings': None,
 'documents': [["Einstein's theory of relativity revolutionized our understanding of space and time."]],
 'uris': None,
 'data': None,
 'metadatas': [[{'genre': 'science'}]],
 'distances': [[0.7625820240341616]],
 'included': [<IncludeEnum.distances: 'distances'>,
  <IncludeEnum.documents: 'documents'>,
  <IncludeEnum.metadatas: 'metadatas'>]}

#### [Using Where filters](https://docs.trychroma.com/guides#using-where-filters)


Chroma supports filtering queries by `metadata` and `document` contents. 

The `where` filter is used to filter by metadata, 

operators: 
- `$eq` - equal to (string, int, float)
- `$ne` - not equal to (string, int, float)
- `$gt` - greater than (int, float)
- `$gte` - greater than or equal to (int, float)
- `$lt` - less than (int, float)
- `$lte` - less than or equal to (int, float)


Using inclusion operators (`$in` and `$nin`)
- `$in` - compare against a list of values

The `where_document` filter is used to filter by document contents.

operators:

- `$contains` 
- `$not_contains`

In [9]:
collection.query(
    query_texts=["Teach me about music history"],
    where={"genre": {"$eq": "music"}},   
    # where={"genre": "music"},  # short form
    n_results=1,
)

{'ids': [['id9']],
 'embeddings': None,
 'documents': [["Beethoven's Symphony No. 9 is celebrated for its powerful choral finale, 'Ode to Joy.'"]],
 'uris': None,
 'data': None,
 'metadatas': [[{'genre': 'music'}]],
 'distances': [[0.8186328747075663]],
 'included': [<IncludeEnum.distances: 'distances'>,
  <IncludeEnum.documents: 'documents'>,
  <IncludeEnum.metadatas: 'metadatas'>]}

In [12]:
collection.query(
    query_texts=["Teach me about music history"],
    where_document={"$contains": "Symphony"},
    n_results=1,
)

{'ids': [['id9']],
 'embeddings': None,
 'documents': [["Beethoven's Symphony No. 9 is celebrated for its powerful choral finale, 'Ode to Joy.'"]],
 'uris': None,
 'data': None,
 'metadatas': [[{'genre': 'music'}]],
 'distances': [[0.8186328747075663]],
 'included': [<IncludeEnum.distances: 'distances'>,
  <IncludeEnum.documents: 'documents'>,
  <IncludeEnum.metadatas: 'metadatas'>]}

In [13]:
collection.query(
    query_texts=["Teach me about music history"],
    where_document={"$not_contains": "Symphony"},
    n_results=1,
)

{'ids': [['id2']],
 'embeddings': None,
 'documents': [["Einstein's theory of relativity revolutionized our understanding of space and time."]],
 'uris': None,
 'data': None,
 'metadatas': [[{'genre': 'science'}]],
 'distances': [[0.7625820240341616]],
 'included': [<IncludeEnum.distances: 'distances'>,
  <IncludeEnum.documents: 'documents'>,
  <IncludeEnum.metadatas: 'metadatas'>]}

In [10]:
query_results = collection.query(
    query_texts=["Teach me about music history"],
    where={"genre": {"$in": ["music", "history"]}},
    n_results=2,
)

query_results["documents"], query_results["distances"]

([["Beethoven's Symphony No. 9 is celebrated for its powerful choral finale, 'Ode to Joy.'",
   'The American Revolution had a profound impact on the birth of the United States as a nation.']],
 [[0.8186328747075663, 0.8200413374548509]])