# **Vector databases for embedding systems**

<img src = './images/limits-of-current-approach.png' width=50% height=50%>

So far, we've created embeddings using the OpenAI API and stored them in-memory. 

* Loading all the embeddings into memory (1536 floats ~ 13kB/embedding), which becomes impractical to load for 100,000s or millions of embeddings. 
* Recalculated these embeddings with every query rather than storing them for later use. 
* We computed cosine distances for every embedded document and sorted the results, which are both slow processes that scale linearly. 

To enable embeddings applications with larger datasets in production, we'll need a better solution: __vector databases__!

## **Vector databases**

Here's a typical embeddings application: 

<img src = './images/embedding-app-arch.png' width=50% height=50%>

* Embedded documents are _stored_ and _queried_ from the __vector database__

- The documents to query are embedded and stored in the vector database. 
- A query is sent from the application interface, embedded, and used to query the embeddings in the database. This query can be a semantic search query or data to base recommendations on. 
- Finally, these results are returned by to the user via the application interface. 

Because the embedded documents are stored in the vector database, they don't have to created with each query or stored in-memory. Additionally, due to the architecture of the database, the similarity calculation is computed much more efficiently.

### **NoSQL databases vs SQL databases**

The majority of vector databases are what's called NoSQL databases, which contrasts conventional databases.

<div style="display: flex;">
    <div style="flex: 50%; padding: 5px; border-right: 2px solid DodgerBlue;">
        **NoSQL Database**  <br>
        <div style="text-align: center;">
            <img src='./images/nosql-db.png' width=75% height=80%>
        </div>
        NoSQL databases don't use tables
        <ul>
            <li>More flexible structure that allows for <i>faster querying</i></li>
            <li>Three examples are shown above: including key:value, document, and graph databases</li>
        </ul>
    </div>
    <div style="flex: 50%; padding: 5px;">
        **SQL/Relational Database**  <br>
        <div style="text-align: center;">
            <img src='./images/sql-db.png' width=75% height=80%>
        </div>
        <ul>
            <li>Structured data into tables, rows. and columns</li>
        </ul>
    </div>
</div>

### **The vector database landscape**

<img src='./images/vector-db-options.png' width=60% height=60%>

When deciding which database solution to go with, there are several factors to consider.

### **Which solution is best?**

<div style="display: flex;">
    <div style="flex: 50%; padding: 5px; border-right: 2px solid DodgerBlue;">
        <ul>
           <li><b>Database management</b></li>
           <ul>
               <li>Managed &#x2192; more expensive but lowers workload.</li>
               <li>Self-managed &#x2192; cheaper but requires time and expertise
            </ul>
        </ul>
        <ul>
            <li><b>Open source or commercial?</b></li>
            <ul>
               <li>Open source &#x2192; flexible and cost-effective if budgets are tight</li>
               <li>Commercial &#x2192; offers better support, more advanced features, and compliance</li>
        </ul>
    </div>
    <div style="flex: 50%; padding: 5px;">
        <ul>
            <li><b>Data models</b>: does the type of data lend itself to a particular database type?</li>
            <li><b> Specific features</b>: does your use case depend on specific functionality, like embedding and storung both text and images for a multi-modal application?</li>
        </ul>
        In this course, we'll be using Chroma, as it's open-source and quick to set up.
        <div style="text-align: center;">
            <img src='./images/chroma.png' width=50% height=50%>
        </div>
    </div>
</div>



## **Creating vector databases with ChromaDB**

ChromaDB has two modes:

* __Local mode__:
  * Great for development and prototyping. Everything runs on our local machine, inside Python.
* __Client/Server mode __:
  *  Made for production. Requires running a separate process for the chroma server.

We'll be using the local mode.

### **Connecting to the database**

In order to connect and query the database, we need to create a client: We import the chroma and create a persistend cliend by calling the `PersistentClient()` function. Persisten client saves the database files to disk at the path specified.

```python
import chromadb

client = chroma.PersistentClient(path='/path/to/save/to')
```

### **Creating a collection**

To add embeddings to the database, we must first create a collection. Collections are analogous to tables, where we can create as many as we want to store our data. To create the collection, we use the `.create_collection()` method. When creating a collection, we need to pass the name of our collection, which is used as a reference, and the function for creating the embeddings; here, we specify the OpenAI embedding function and API key.

```python
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction

collection = client.create_collection(
    name="my_collection",
    embedding_function=OpenAIEmbeddingFunction(
        model_name="text-embedding-3-small",
        api_key="<OPENAI_API_KEY>"
    )
)
```

### **Inspecting the collection**

The `list_collections` method lists all of the collections in the database, so we can verify our collection was created.

```python
client.list_collections()
```

Output: <br>
`[Collection(name=my_collection)]`

### **Inserting embeddings**

We are now ready to add embeddings into the collection. We can do so with the `collection.add` method.

* IDs must be provided
* Embeddings will be created by the collection.

Since the collection is already aware of the embedding function, it will embed the source texts automatically using the function specified. Most of the time, we'll insert multiple documents at once, which we can do by passing multiple ids and documents.

**Single Document**

```python
collection.add(ids=['my-doc'], documents=['This is a the source text'])
```

**Multiple Documents**

```python
collection.add(ids=['my-doc1', 'my-doc2'], documents=['This is document 1', 'This is document 2'])
```

### **Inspecting the collection**

After inserting documents, we can inspect the collection with twi methods

1. __`collection.count()` method__: Returns the total number of documents in the collection.

```python
collection.count()
```
Output: <br>
`3`

2. __`collection.peek()` method__: Returns the first __10__ items in the collection.__

```python
collection.peek()
```
Output: <br>
<img src='./images/collection-peek.png' width=50% height=50%>

We can also retrieve particular items by their ID using the `.get()` method.

```python
collection.get(ids=['s59'])
```
Output: <br>
<img src='./images/retrieve-item.png' width=50% height=50%>

### **Netflix Dataset**
In the following exercises, we';; insert a dataset of Netflix titles into a Chroma dataase. for each title, we'll embed a source text including the title, description, and categories. 

<img src='./images/netflix-dataset.png' width=50% height=50%>

While this is not a massive dataset, we must not forget that each of these texts is going to be sent to the OpenAI embedding endpoint and therefore cost money. Before inserting a sizable dataset into a collection, it's important to get an idea of the cost.

### **Cost Estimation**

* Embedding model (`text-embedding-3-small`) costs $0.00002/1k tokens ($0.02/1M tokens)

OpenAI provides the cost per thousand tokens on their model pricing page, which means we can find the total cost

`cost = 0.00002 * len(tokens)/1000`

We can count tokens with `tiktoken` library. (to install `pip install tiktoken`)

### **Estimating embedding cost**

Tiktoken can convert any text into tokens. 

First, we use the `encoding_for_model` function to get a token encoder for the embedding model we're using. To calculate the total number of tokens, we use the following code. This reads: for each text in documents, encode it using the encoder and take the `length` to obtain the number of tokens in the text. Finally, `sum` the results. This code is much more concise and efficient than looping through the documents.

Finally, we calculate the price by multiplying `total_tokens` by `cost_per_1k_tokens` over `1000`, and print the result.

```python
import tiktoken

enc = tiktoken.encoding_for_model('text-embeddding-3-small')

total_tokens = sum(len(end.encode(text)) for text in documents)

cost_per_1k_tokens = 0.00002

print('Total tokens:", total_tokens)
print('Cost:', cost_per_1k_tokens * total_tokens/1000)
```

Then it would give:
```
Total tokens: 444463
Cost: 0.0888926
```

### Example

**Instructions**

* Create a persistent client to save the database files to disk; you can leave out the file path for these exercises.
* Create a database collection called `netflix_titles` that uses the OpenAI embedding function.
* List all of the collections in the database.

In [None]:
import chromadb
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction
import os                               # to get the current working directory
from dotenv import load_dotenv          # to load the .env file

# Load the .env file
load_dotenv()

# Get the API key from the .env file
api_key = os.getenv("OPENAI_API_KEY")

# Create a persistant client
client = chromadb.PersistentClient(path="./datasets/")

# Create a netflix_title collection using the OpenAI Embedding function
collection = client.create_collection(
    name="netflix_titles",
    embedding_function=OpenAIEmbeddingFunction(model_name="text-embedding-3-small", api_key="<OPENAI_API_TOKEN>")
)

# List the collections
print(client.list_collections())

Now that we've created a database and collection to store the Netflix films and TV shows, we can begin embedding data.

Before embedding a large dataset, it's important to do a cost estimate to ensure you don't go over any budget restraints. Because OpenAI models are priced by number of tokens inputted, we'll use OpenAI's tiktoken library to count the number of tokens and convert them into a dollar cost.

You've been provided with document texts as `documents`, which has been extracted from `netflix_titles_1000.csv`. Here is the first document from `documents`:
```
Title: Dick Johnson Is Dead (Movie)
Description: As her father nears the end of his life, filmmaker Kirsten Johnson stages his death in inventive and comical ways to help them both face the inevitable.
Categories: Documentaries
```

For later use, you've also been provided with document IDs.

You'll now iterate over the list, encode each document, and count the total number of tokens. Finally, you'll use the model's pricing to convert this into a cost.


In [None]:
import csv

ids = []
documents = []

with open('./datasets/netflix_titles.csv') as csvfile:
  reader = csv.DictReader(csvfile)
  for i, row in enumerate(reader):
    ids.append(row['show_id'])
    text = f"Title: {row['title']} ({row['type']})\nDescription: {row['description']}\nCategories: {row['listed_in']}"
    documents.append(text)

# Print the loaded documents
print(documents)

In [None]:
import tiktoken


# Load the encoder for the OpenAI text-embedding-3-small model
enc = tiktoken.encoding_for_model("text-embedding-3-small")

# Encode each text in documents and calculate the total tokens
total_tokens = sum(len(enc.encode(text)) for text in documents)

cost_per_1k_tokens = 0.00002

# Display number of tokens and cost
print('Total tokens:', total_tokens)
print('Cost:', cost_per_1k_tokens * total_tokens/1000)

That means, for each request, the cost will be $0.00888926.

Time to add those Netflix films and TV shows to your collection.

__Instructions__

* Recreate your `netflix_titles` collection.
* Add the documents and their IDs to the collection.
* Print the number of documents in `collection` and the first ten items.

In [None]:
import csv
import chromadb
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction
import os                               # to get the current working directory
from dotenv import load_dotenv          # to load the .env file

# Load the .env file
load_dotenv()

# Get the API key from the .env file
api_key = os.getenv("OPENAI_API_KEY")

# Create a persistant client
client = chromadb.PersistentClient(path="./datasets/")

ids = []
documents = []

with open('./datasets/netflix_titles.csv') as csvfile:
  reader = csv.DictReader(csvfile)
  for i, row in enumerate(reader):
    ids.append(row['show_id'])
    text = f"Title: {row['title']} ({row['type']})\nDescription: {row['description']}\nCategories: {row['listed_in']}"
    documents.append(text)

# Recreate the netflix_titles collection
collection = client.get_collection(
  name="netflix_titles",
  embedding_function=OpenAIEmbeddingFunction(
    model_name="text-embedding-3-small", 
    api_key=api_key)
)

# Add the documents and IDs to the collection
collection.add(ids=ids, documents=documents)

# Print the collection size and first ten items
print(f"No. of documents: {collection.count()}")
print(f"First ten documents: {collection.peek()}")
collection.get(['s1001'])

## **Querying and updating the database**

### **Querying the database**

Similar to what we did manually in the previous chapter, we'll build a semantic search application, but this time, using a vector database. The approach is exactly the same: we have a query string and we want to find similar titles in our collection. 

Previously, we had to embed the query string to get a query vector, which was used to find similar embeddings in the dataset:

<img src="./images/previously.png" width=50% height=50%>

With Chroma, we'll let the collection do the embedding, so we can pass our query string directly and Chroma will take care of creating the embedding and performing the search:

<img src="./images/now-chroma.png" width=50% height=50%>

First, we need to retrieve our collection, which we can do with `client.get_collection()`, specifying the name of the collection to retrieve. Recall that when we created the collection, we specified the embedding function to use, and it's also really important to specify the same function when retrieving the collection. This way, Chroma will use the same embedding function to create the _query vector_.

```python
from chromadb.utils.embedding_function import OpenAiEmbeddingFunction

collection = client.get_collection(
    name='netflix_titles_test1',
    embedding_function= OpenAIEmbeddingFunction(
        api_key="<OPENAI_API_KEY>"
    )
)
```

### **Querying the collection**

To query the collection, we call collection.query, passing our query string to `query_texts`. Note that this parameter is _plural_, so even if we have a single query string, we pass a list. To specify how many items to retrieve, we can use the `n_results` parameter. 

```python
result = collection.query(
    query_texts=["movies where people sing a lot"],
    n_results=3
)
```

Let's run the code below and see the result

In [None]:
import os
from dotenv import load_dotenv
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction

# Load environment variables from .env file
load_dotenv()

# Get API key from environment variable
api_key = os.getenv('OPENAI_API_KEY')

collection = client.get_collection(
    name='netflix_titles_test1',
    embedding_function= OpenAIEmbeddingFunction(
        model_name="text-embedding-3-small",
        api_key=api_key
    )
)

result = collection.query(
    query_texts=["movies where people sing a lot"],
    n_results=3
)

"""To display the output in a more readable format,
we can format the dictionary output with each key-value pair on a new line as follows:"""
"""If you want to see the raw result, you can do print(result)"""

# Format the dictionary output with each key-value pair on a new line
formatted_output = "{\n"
for key, value in result.items():
    if key == 'documents':
        # Format each document on a new line
        formatted_output += f"  '{key}': [\n"
        for doc in value[0]:
            formatted_output += f"    '{doc}',\n"
        formatted_output += "  ],\n"
    else:
        formatted_output += f"  '{key}': {value},\n"
formatted_output += "}"

print(formatted_output)

### **Query Result (dict)**

let's break the output down:

`query()` returns a dictionary with the following keys:
- `ids`: a list of the IDs of the documents that were returned
- `embeddings`: The embeddings of the returned items
- `documents`: The source texts if the returned items
- `metadatas`: The metadatas of the returned items
- `distances`: The distances between the query and the returned items
- `uris`: The URIs of the returned items
- `data`: The data of the returned items
- `included`: The list of lists of included items, excluding `uris` and `data`.


The embeddings entry is emty, simply because Chroma doesn't return them by default. Also, each of these entries has the same format; let's look at ids

### **Query results (list of lists)**

ids contains a list of lists. The reason for this is that the query method accepts a list of query texts, even though we used one query text. - meaning we could use multiple query texts. So, the result follow the same structure:

* First list corresponds to the first query_text
* Multiple query texts will return multiple lists - If we had multiple query texts, we would get back as many lists.

 In this list, we find a format similar to the parameters of the add() method: the first id corresponds to the first document, metadatas, and distances:

 <img src='./images/query_result.png' width=50% height=50%>


### **Updating a collection**

Items in a collection can be updated with the `update` method. The syntax is similar to `collection.add()`; in this example, we'll update the texts for items `id-1` and `id-2`. 

* Include _only_ the fields to update, other fields will be unchanged
* Collection will automatically create embeddings

```python
collecton.update(
    ids=["id-1", "id-2"],
    documents=["New document 1", "New document 2"],
)
```

Alternatively,  if we're not sure if the IDs are already present in the table, use the `upsert` method. `upsert` will add the IDs to the collection if they aren't present, and update them if they are - a combination of the update and add methods.

```python
collection.upsert(
    ids=["id-1", "id-2"],
    documents=["New document 1", "New document 2"],
)
```

### **Deleting**

__Delete items from a collection__

```python
collection.delete(ids=["id-1", "id-2"])
```

__Delete all collections and items__
* __Warning__: This will delete everything in the database!

```python
client.reset()
```

Now that you've created and populated the `netflix_titles_test1` collection, it's time to query it!

You'll use it to provide recommendations for films and TV shows about dogs to one of your colleagues who loves dogs!

You've been also provided with two new Netflix titles stored in `new_data`. 

You'll either add or update these IDs in the database depending on whether they're already present in the collection.

__Instructions__

* Retrieve the netflix_titles collection, specifying the OpenAI embedding function so the query is embedded using the same function as the documents.
* Extract the IDs and documents from `new_data`, and use a single method to update them in the `netflix_titles_test1` collection if they already exist and add them if they don't
* After you've added/updated the items, delete the item with ID 's95'.
* Query the collection for "films about dogs" and return three results.

In [None]:
new_data= [{"id": "s1001", "document": "Title: Cats & Dogs (Movie)\nDescription: A look at the top-secret, high-tech espionage war going on between cats and dogs, of which their human owners are blissfully unaware."},
 {"id": "s6884", "document": 'Title: Goosebumps 2: Haunted Halloween (Movie)\nDescription: Three teens spend their Halloween trying to stop a magical book, which brings characters from the "Goosebumps" novels to life.\nCategories: Children & Family Movies, Comedies'}]

# Retrieve the netflix_titles collection
collection = client.get_collection(
  name="netflix_titles_test1",
  embedding_function=OpenAIEmbeddingFunction(model_name="text-embedding-3-small", api_key=api_key)
)


# Update or add the new documents
collection.upsert(
    ids=[doc['id'] for doc in new_data],
    documents=[doc['document'] for doc in new_data]
)

# Delete the item with ID "s95"
collection.delete(ids=['s95'])

# Query the collection for "films about dogs"
result = collection.query(
  query_texts=['films about dogs'],
  n_results=3
)

print(result)

## **Multiple queries and filtering**

### **Movie recommendations based on multiple datapoints**

In the previous chapter, we used embeddings to make recommendations based on multiple data points. Let's do the same with the Netflix dataset and Chroma. We'll recommend movies related to other titles that a user has seen. Let's assume a user has seen a horror film and a kid's TV show:

- Terrifier (id: 's8170)
- Strawbery Shortcake: Berry Bitty Adventures (id: 's8103)

It's an odd combination, but hopefully it will help differentiate the recommendations.

 we'll use the embedded texts of the reference items as queries. First, we're using `collection.get` to retrieve both of our reference texts. Notice that we're only extracting and storing the documents from these items in `reference_texts`. Since `collection.query` supports multiple query texts, we can pass our `reference_texts` directly; we'll ask for three results.

 ```python
reference_ids = ['s8170', 's8103']

reference_texts = collection.get(ids=reference_ids)['documents']

result = collection.query(
    query_texts= reference_texts,
    n_results=3
)
 ```

In [None]:
collection.get(ids=['s8103'])

In [None]:
reference_ids = ['s8170', 's8103']

reference_texts = collection.get(ids=reference_ids)['documents']

result = collection.query(
    query_texts= reference_texts,
    n_results=3
)

print(result)