[Embeddings and Vector Databases With ChromaDB](https://realpython.com/chromadb-vector-database/) 

- Representing unstructured objects with vectors
- Using word and text embeddings in Python
- Harnessing the power of vector databases
- Encoding and querying over documents with ChromaDB
- Providing context to LLMs like ChatGPT with ChromaDB

[Ode to Joy](https://claude.ai/chat/3883912c-bae6-4f82-85de-2f09382f1c90) a fruitful chat session to be followed up

## Vector Basics


A better way to compute the dot product is to use the at-operator (@), which can perform both vector and matrix multiplications, and the syntax is cleaner.

In [2]:
import numpy as np

v1 = np.array([1, 0])
v2 = np.array([0, 1])
v3 = np.array([np.sqrt(2), np.sqrt(2)])

# Dimension
v1.shape
# (2,)

(2,)

In [3]:
# Magnitude
np.sqrt(np.sum(v1**2)) ,  np.linalg.norm(v1) ,  np.linalg.norm(v3)
# 1.0,  1.0, 2.0

(1.0, 1.0, 2.0)

In [4]:
# Dot product
np.sum(v1 * v2)
# 0

0

In [5]:
v1 @ v2, v2 @ v3
# 1.4142135623730951

(0, 1.4142135623730951)

## Vector Similarity

cosine similarity - a normalized form of the dot product. 

## Encode Objects in Embeddings

Embeddings are a way to represent data such as words, text, images, and audio in a numerical format that computational algorithms can more easily process.

More specifically, embeddings are dense vectors that characterize meaningful information about the objects that they encode. The most common kinds of embeddings are word and text embeddings, 



### Word Embeddings

A word embedding is a vector that captures the semantic meaning of word. Ideally, words that are semantically similar in natural language should have embeddings that are similar to each other in the encoded vector space. Analogously, words that are unrelated or opposite of one another should be further apart in the vector space. related words are clustered together, while unrelated words are far from each other.

```
conda create -n rag python=3.11
conda activate rag
python -m pip install spacy
python -m spacy download en_core_web_md
```

In [5]:
import numpy as np
import spacy
nlp = spacy.load("en_core_web_md")

dog_embedding = nlp.vocab["dog"].vector
type(dog_embedding),  dog_embedding.shape,  dog_embedding[0:3]
# (numpy.ndarray,  (300,), array([-0.72483 ,  0.42538 ,  0.025489], dtype=float32))

(numpy.ndarray,
 (300,),
 array([-0.72483 ,  0.42538 ,  0.025489], dtype=float32))

In [6]:
def compute_cosine_similarity(u: np.ndarray, v: np.ndarray) -> float:
    """Compute the cosine similarity between two vectors"""
    return (u @ v) / (np.linalg.norm(u) * np.linalg.norm(v))

In [20]:
dog_embedding = nlp.vocab["dog"].vector
cat_embedding = nlp.vocab["cat"].vector
apple_embedding = nlp.vocab["apple"].vector
tasty_embedding = nlp.vocab["tasty"].vector
delicious_embedding = nlp.vocab["delicious"].vector
truck_embedding = nlp.vocab["truck"].vector

In [21]:
compute_cosine_similarity(dog_embedding, cat_embedding)

np.float32(1.0000001)

In [22]:
compute_cosine_similarity(delicious_embedding, tasty_embedding)

np.float32(0.450864)

In [23]:
compute_cosine_similarity(apple_embedding, delicious_embedding)

np.float32(0.39558223)

In [24]:
compute_cosine_similarity(dog_embedding, apple_embedding)

np.float32(0.2334378)

In [25]:
compute_cosine_similarity(truck_embedding, delicious_embedding)

np.float32(0.036047027)

### Text Embeddings

Text embeddings encode information about sentences and documents, not just individual words, into vectors. This allows you to compare larger bodies of text to each other just like you did with word vectors. Because they encode more information than a single word embedding, text embeddings are a more powerful representation of information.