# Vectors, Vectors, Vectors

As with many things in life, it all boils down to linear algebra and a few non-linear functions.

Vector representations enables similarity calculations and we can think of several applications that follow from it: question answering, evaluation procedures, fetching related texts, etc. Because of this usefulness, we want to find efficient ways of 1) obtaining vector representations, 2) operating on them, and 3) storing them for later use.

In this notebook, we will discuss how to obtain embeddings from OpenAI API and use a vector database to store and operate on vector representations.

In [1]:
%load_ext dotenv
%dotenv ../../05_src/.secrets

In [2]:
import os
from openai import OpenAI
import os
client = OpenAI(base_url='https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1', 
                api_key='any value',
                default_headers={"x-api-key": os.getenv('API_GATEWAY_KEY')})

Our sample phrases cover three topics: freedom, friendship, and food.

In [4]:
phrases = [
    # Good food (10)
    "A warm meal turns a hard day into something you can survive.",
    "The best seasoning is the feeling that someone made this for you.",
    "Good cooking is attention made edible.",
    "A shared table is a small, daily celebration.",
    "Comfort has a flavor, and it usually tastes like home.",
    "Food is memory you can hold in your hands.",
    "The first bite can be a doorway back to your happiest place.",
    "When the soup is right, the whole world softens a little.",
    "The simplest dish becomes extraordinary when it’s made with care.",
    "A good meal doesn’t just feed you—it reassures you.",

    # Friendship (10)
    "Real friends don’t rescue you from storms; they sit beside you in the rain.",
    "Friendship is the quiet agreement to keep showing up.",
    "A friend is someone who makes your good news bigger and your bad news smaller.",
    "You can measure trust by how safe silence feels.",
    "The truest kindness is being understood without needing to perform.",
    "Friendship is laughing at the same nonsense for years and never getting tired of it.",
    "The best friendships feel like exhaling.",
    "A friend is a mirror that reflects your worth on days you forget it.",
    "Some people become home, even if you never share an address.",
    "Good friends help you carry life without making you feel heavy.",

    # Community (10)
    "Community is strangers becoming neighbors one small favor at a time.",
    "Belonging isn’t found—it’s built.",
    "A community is a chorus: different voices, one song.",
    "When one person is lifted, the whole street rises a little.",
    "Care is the infrastructure that holds a place together.",
    "A shared future starts with shared responsibility.",
    "The strongest neighborhoods are stitched together by everyday generosity.",
    "Community is the art of making room for one more.",
    "We are safer when we are seen.",
    "Together is a direction, not just a feeling.",
]

We have 14 phrases in total:

In [5]:
len(phrases)

30

To obtain embeddings, we will use the `text-embedding-3-small` model. This model generates 1536-dimensional vectors for each input text. 

The documentation for the embeddings API can be found [here](https://platform.openai.com/docs/guides/embeddings).

# A Simple Input

We first start with a simple example using the first document/phrase:

In [7]:
phrases[0]

'A warm meal turns a hard day into something you can survive.'

In [8]:
client = OpenAI(base_url='https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1', 
                api_key='any value',
                default_headers={"x-api-key": os.getenv('API_GATEWAY_KEY')})

response = client.embeddings.create(
    input = phrases[0], 
    model = "text-embedding-3-small"
)

In [9]:
response.data

[Embedding(embedding=[-0.00527027016505599, -0.0011117729591205716, -0.025171248242259026, 0.007839486934244633, 0.009685206227004528, -0.05623336508870125, 0.027242055162787437, 0.014276998117566109, 0.05168015882372856, -0.015524627640843391, 0.0019389706430956721, 0.04717840254306793, -0.028965584933757782, 0.02976303920149803, 0.049184899777173996, 0.03408472612500191, 0.0686839371919632, 0.05034249648451805, -0.018907375633716583, 0.0421878844499588, 0.003209108952432871, -0.034136172384023666, -0.020180730149149895, -0.004717197269201279, 0.004996949341148138, 0.03639991208910942, 0.0034631367307156324, 0.01556321419775486, 0.028116682544350624, -0.022328710183501244, -0.010257572866976261, -0.030251801013946533, -0.03058621659874916, -0.049159176647663116, 0.0412103608250618, 0.002032221294939518, -0.006353907287120819, 0.036811500787734985, -0.003366670338436961, 0.020129280164837837, 0.0032830664422363043, 0.0010884603252634406, 0.05137146636843681, 0.0728512778878212, 0.00609

# Loop through Simple Inputs

We will now try the example found in the [API documentation](https://platform.openai.com/docs/guides/embeddings/embeddings#obtaining-the-embeddings), which simply loops through the documents, calling the API each time. The function below first performs a simple cleanup (removes line breaks), then requests the embeddings.

In [10]:
def get_embedding(text, model="text-embedding-3-small"):
    text = text.replace("\n", " ")
    return client.embeddings.create(input=[text], model=model).data[0].embedding

Using Python's list comprehension syntax, we can run the function for each of our example phrases.

In [11]:
embeddings = [get_embedding(doc) for doc in phrases]


The statement above is roughly equivalent to:

```python
embeddings = []
for doc in phrases:
    doc_emb = get_embedding(doc)
    embeddings.append(doc_emb)
```

In [12]:
embeddings

[[-0.0052934628911316395,
  -0.001104680821299553,
  -0.025148773565888405,
  0.007814772427082062,
  0.009712185710668564,
  -0.05613772198557854,
  0.02727130427956581,
  0.014278843067586422,
  0.05171256512403488,
  -0.015552361495792866,
  0.0019263573922216892,
  0.04713304713368416,
  -0.02894360013306141,
  0.029766885563731194,
  0.04921698570251465,
  0.03411485627293587,
  0.06869281083345413,
  0.05032327398657799,
  -0.018884090706706047,
  0.04219333827495575,
  0.0031725403387099504,
  -0.03411485627293587,
  -0.02017047442495823,
  -0.004688863176852465,
  0.005007242783904076,
  0.036430343985557556,
  0.0034539364278316498,
  0.015565224923193455,
  0.028120316565036774,
  -0.022331595420837402,
  -0.010278194211423397,
  -0.03022998385131359,
  -0.030590170994400978,
  -0.049165528267621994,
  0.041138503700494766,
  0.0020324839279055595,
  -0.006377240177243948,
  0.0368419885635376,
  -0.0033542418386787176,
  0.020106155425310135,
  0.0032770587131381035,
  0.001

# Sending Lists of Inputs to the API

We can also send a collection of inputs to the API:

In [13]:
client = OpenAI(base_url='https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1', 
                api_key='any value',
                default_headers={"x-api-key": os.getenv('API_GATEWAY_KEY')})

response = client.embeddings.create(
    input = phrases, 
    model = "text-embedding-3-small"
)
response.data

[Embedding(embedding=[-0.00527027016505599, -0.0011117729591205716, -0.025171248242259026, 0.007839486934244633, 0.009685206227004528, -0.05623336508870125, 0.027242055162787437, 0.014276998117566109, 0.05168015882372856, -0.015524627640843391, 0.0019389706430956721, 0.04717840254306793, -0.028965584933757782, 0.02976303920149803, 0.049184899777173996, 0.03408472612500191, 0.0686839371919632, 0.05034249648451805, -0.018907375633716583, 0.0421878844499588, 0.003209108952432871, -0.034136172384023666, -0.020180730149149895, -0.004717197269201279, 0.004996949341148138, 0.03639991208910942, 0.0034631367307156324, 0.01556321419775486, 0.028116682544350624, -0.022328710183501244, -0.010257572866976261, -0.030251801013946533, -0.03058621659874916, -0.049159176647663116, 0.0412103608250618, 0.002032221294939518, -0.006353907287120819, 0.036811500787734985, -0.003366670338436961, 0.020129280164837837, 0.0032830664422363043, 0.0010884603252634406, 0.05137146636843681, 0.0728512778878212, 0.00609

# Vector DB

We can use a specialized database to store our embeddings, relate them to documents, and efficiently perform computations like cosine similarity.

![](img/02_chroma.png)

The document database that we will use for our experiments is Chroma DB, a simple implementation of Vector DB that is commonly used for prototyping. 

A few useful references are: 
- [ChromaDB Documentation](https://docs.trychroma.com/docs/overview/introduction).
- [ChromaDB Cookbook](https://cookbook.chromadb.dev/running/running-chroma/#chroma-cli).

Chroma can be run locally in memory, locally using file persistence, or using a Docker container.

## Running Chroma Locally in Memory

The simplest implementation is to run Chroma DB in memory without persistence.

In [18]:
import chromadb

chroma_client = chromadb.Client()

First, create a collection. A collection is a container that groups documents together. A collection would be equivalent to a table which groups togher records in a relational database.

In [19]:
collection = chroma_client.create_collection(name = "nice_phrases")

InternalError: Collection [nice_phrases] already exists

Then, add documents to our collection. Each document will contain:

1. An identifier.
2. The phrase.
3. The embeddings.

In [20]:
embeddings = [item.embedding for item in response.data]
ids = [f"id{i}" for i in range(len(phrases))]

In [21]:
collection.add(embeddings = embeddings, 
               documents = phrases, 
               ids = ids)

Now, we can use Chroma DB's [`query`](https://docs.trychroma.com/docs/querying-collections/query-and-get) method to perform a query using similarity search. 

## Performing a Search Using Custom Embeddings

We could use a function such as the one below to provide our own embeddings of the query text.

In [22]:
def query_chromadb(query, top_n = 2):
    query_embedding = get_embedding(query)
    results = collection.query(query_embeddings = [query_embedding], n_results = top_n)
    return [(id, score, text) for id, score, text in zip(results['ids'][0], results['distances'][0], results['documents'][0])]

In [23]:
query = "What is good food?"

query_chromadb(query, top_n=3)

[('id9',
  1.0859720706939697,
  'A good meal doesn’t just feed you—it reassures you.'),
 ('id2', 1.1407593488693237, 'Good cooking is attention made edible.'),
 ('id5', 1.1558701992034912, 'Food is memory you can hold in your hands.')]

## Performing a Search Using Embedding Function

Alternatively, we can define the embedding function at the moment in which we create the collection.

If needed, list and remove any collection as you require:

In [24]:
chroma_client.list_collections()

[Collection(name=nice_phrases)]

In [25]:
chroma_client.delete_collection("nice_phrases")

We can now re-use the collection name using an OpenAI embedding function. Notice that we pass the `api_key` parameter explicitly, as the environment variable name that holds the API key for Chroma DB and for the OpenAI library are different.

In [26]:
import os
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction

collection = chroma_client.create_collection(
    name = "nice_phrases",
    embedding_function = OpenAIEmbeddingFunction(
        api_key = "any value",
        model_name="text-embedding-3-small",
        api_base='https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1',
        default_headers={"x-api-key": os.getenv('API_GATEWAY_KEY')}
))
collection.add(embeddings = embeddings, 
               documents = phrases, 
               ids = ids)

With the embedding function, we can now perform the query:

In [27]:
collection.query(
    query_texts = ["What is a friend?", "What is good food?"], 
    n_results = 2
)

{'ids': [['id12', 'id17'], ['id9', 'id2']],
 'embeddings': None,
 'documents': [['A friend is someone who makes your good news bigger and your bad news smaller.',
   'A friend is a mirror that reflects your worth on days you forget it.'],
  ['A good meal doesn’t just feed you—it reassures you.',
   'Good cooking is attention made edible.']],
 'uris': None,
 'included': ['metadatas', 'documents', 'distances'],
 'data': None,
 'metadatas': [[None, None], [None, None]],
 'distances': [[0.8235226273536682, 1.0530608892440796],
  [1.0859720706939697, 1.1407593488693237]]}